XGBoost Compare "max_cat_threshold" vs "max_cat_to_onehot" Parameters

When using XGBoost with categorical features, setting enable_categorical=True allows the model to handle categorical data directly.

Two important parameters come into play: max_cat_threshold and max_cat_to_onehot.

This example compares and contrasts these parameters and provides a code snippet demonstrating their usage.

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import time
from sys import getsizeof

# Generate a synthetic dataset with high cardinality categorical features
X, y = make_classification(n_samples=100000, n_features=10, n_informative=5,
                           n_redundant=0, n_classes=2, n_clusters_per_class=2,
                           weights=[0.8, 0.2], flip_y=0.01, class_sep=1.0,
                           hypercube=True, shift=0.0, scale=1.0, shuffle=True,
                           random_state=42)

# Convert to DataFrame and modify categorical features
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
cat_features = ['feature_0', 'feature_1', 'feature_2', 'feature_3']
for feat in cat_features:
    # Simulate high cardinality by mapping each value to a unique category
    df[feat] = (df[feat] * 1000).round().astype(int).astype('category')

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

# Model configurations
configurations = [
    ("Default", {}),
    ("Max Cat Threshold", {"max_cat_threshold": 10}),
    ("Max Cat To Onehot", {"max_cat_to_onehot": 10})
]

# Evaluate models
results = []
for name, params in configurations:
    model = XGBClassifier(enable_categorical=True, eval_metric='logloss', random_state=42, **params)

    start_time = time.time()
    model.fit(X_train, y_train)
    fit_time = time.time() - start_time

    accuracy = model.score(X_test, y_test)

    results.append((name, accuracy, fit_time))

# Print results
for result in results:
    print(f"Model: {result[0]}, Accuracy: {result[1]:.2f}, Fit Time: {result[2]:.2f} seconds")

The max_cat_threshold parameter sets the maximum number of categories considered for each split and is used only by partition-based splits to prevent overfitting.

On the other hand, max_cat_to_onehot is a threshold for deciding how to use a one-hot encoding based split for categorical data. When the number of categories is less than max_cat_to_onehot, defines the number of one hot encoded categories.

The example compares a default model that uses all categories, a model that groups categories into 10 optimal groups or splits, and another that uses a one hot encoding with 10 optimal splits in a one-vs-rest manner.

Optimizing the splits via max_cat_threshold and max_cat_to_onehot cause the fit of the model to take longer, although will likely result in smaller trees that use less memory.

See Also