The max_cat_to_onehot
parameter in XGBoost controls how the algorithm handles categorical features.
It determines the maximum number of categories that will be one-hot encoded.
To use this parameter, you must set enable_categorical
to True
and use a pandas dataframe with categorical columns marked as category
.
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Generate synthetic data
data = {
'numerical_1': [1, 2, 3, 4, 5] * 20,
'numerical_2': [10, 20, 30, 40, 50] * 20,
'categorical_1': ['A', 'B', 'C', 'D', 'E'] * 20,
'categorical_2': ['X', 'Y', 'Z'] * 33 + ['X'],
'target': [0, 1] * 50
}
df = pd.DataFrame(data)
# Mark categorical columns
df['categorical_1'] = df['categorical_1'].astype('category')
df['categorical_2'] = df['categorical_2'].astype('category')
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42)
# Initialize the XGBoost classifier with enable_categorical and max_cat_to_onehot
model = XGBClassifier(enable_categorical=True, max_cat_to_onehot=3, eval_metric='logloss')
# Fit the model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
The max_cat_to_onehot
parameter specifies the maximum number of categories that will be one-hot encoded.
If a categorical feature has more unique categories than this threshold, XGBoost will instead group categories and choose seek optimal partitioning of groups based on leaf values.
This can be useful when you have high cardinality categorical features and want to reduce memory usage, as one-hot encoding can lead to a large number of new columns.
You might want to adjust this parameter when you have categorical features with many unique categories and want more fewer splits.
Setting max_cat_to_onehot=1
will maximally focus on optimal partitioning, splitting categories of each variable into two “optimal” groups.
A value of max_cat_to_onehot
larger than the number of categories will cause XGBoost not not use one hot encoding (e.g. one child node per category) and in turn not seek optimal groupings of categories.
By default, max_cat_to_onehot
is not used.