When using XGBoost with categorical features, setting enable_categorical=True
allows the model to handle categorical data directly.
Two important parameters come into play: max_cat_threshold
and max_cat_to_onehot
.
This example compares and contrasts these parameters and provides a code snippet demonstrating their usage.
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import time
from sys import getsizeof
# Generate a synthetic dataset with high cardinality categorical features
X, y = make_classification(n_samples=100000, n_features=10, n_informative=5,
n_redundant=0, n_classes=2, n_clusters_per_class=2,
weights=[0.8, 0.2], flip_y=0.01, class_sep=1.0,
hypercube=True, shift=0.0, scale=1.0, shuffle=True,
random_state=42)
# Convert to DataFrame and modify categorical features
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
cat_features = ['feature_0', 'feature_1', 'feature_2', 'feature_3']
for feat in cat_features:
# Simulate high cardinality by mapping each value to a unique category
df[feat] = (df[feat] * 1000).round().astype(int).astype('category')
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)
# Model configurations
configurations = [
("Default", {}),
("Max Cat Threshold", {"max_cat_threshold": 10}),
("Max Cat To Onehot", {"max_cat_to_onehot": 10})
]
# Evaluate models
results = []
for name, params in configurations:
model = XGBClassifier(enable_categorical=True, eval_metric='logloss', random_state=42, **params)
start_time = time.time()
model.fit(X_train, y_train)
fit_time = time.time() - start_time
accuracy = model.score(X_test, y_test)
results.append((name, accuracy, fit_time))
# Print results
for result in results:
print(f"Model: {result[0]}, Accuracy: {result[1]:.2f}, Fit Time: {result[2]:.2f} seconds")
The max_cat_threshold
parameter sets the maximum number of categories considered for each split and is used only by partition-based splits to prevent overfitting.
On the other hand, max_cat_to_onehot
is a threshold for deciding how to use a one-hot encoding based split for categorical data. When the number of categories is less than max_cat_to_onehot
, defines the number of one hot encoded categories.
The example compares a default model that uses all categories, a model that groups categories into 10 optimal groups or splits, and another that uses a one hot encoding with 10 optimal splits in a one-vs-rest manner.
Optimizing the splits via max_cat_threshold
and max_cat_to_onehot
cause the fit of the model to take longer, although will likely result in smaller trees that use less memory.