Configure XGBoost "max_cat_threshold" Parameter

The max_cat_threshold parameter in XGBoost controls the maximum number of categories considered for each split when using categorical features.

This parameter is only applicable when enable_categorical is set to True and categorical features are marked as “category” in a pandas DataFrame.

Adjusting max_cat_threshold can help prevent overfitting in partition-based splits.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import pandas as pd

# Generate synthetic data with categorical features
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=2, n_classes=2, random_state=42)

# Convert to pandas DataFrame and mark categorical features
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(10)])
X_df['feature_0'] = X_df['feature_0'].astype('category')
X_df['feature_1'] = X_df['feature_1'].astype('category')

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42)

# Initialize the XGBoost classifier with enable_categorical and max_cat_threshold
model = XGBClassifier(enable_categorical=True, max_cat_threshold=10, eval_metric='logloss')

# Fit the model
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

The max_cat_threshold parameter sets the maximum number of categories considered for each split when using categorical features. It is only applicable when enable_categorical is set to True and categorical features are marked as “category” in a pandas DataFrame.

By default, max_cat_threshold us not used, meaning that all categories for each categorical value are considered.

Adjusting the max_cat_threshold parameter can be beneficial when working with datasets containing high-cardinality categorical features (e.g. hundreds or thousands of categories).

Setting a lower value can help prevent overfitting by limiting the number of categories considered for each split. This is achieved by grouping categorical values into max_cat_threshold optimal subsets, called optimal partitioning.

When tuning the max_cat_threshold parameter, it is recommended to start with a large value and adjust it down based on model performance and validation metrics. Cross-validation can be used to find the optimal value that balances model performance and overfitting prevention. It is essential to consider the trade-off between model complexity and interpretability when setting the parameter.

Experimentation and validation are crucial in finding the optimal max_cat_threshold value for a specific use case. The exact relationship between the parameter value and the model’s performance may vary depending on the dataset and problem domain.

See Also