When working with categorical features in machine learning, encoding them into numerical representations is a crucial preprocessing step.
XGBoost, a popular gradient boosting library, offers built-in support for handling categorical variables efficiently.
In this example, we’ll compare the execution time of manual encoding techniques like one-hot encoding and ordinal encoding with XGBoost’s native categorical encoding using the enable_categorical
parameter and show that native categorical support in XGBoost is faster than manual encoding schemes.
Let’s generate a synthetic dataset with categorical features using scikit-learn and measure the execution times of different encoding approaches.
from sklearn.datasets import make_classification
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
import time
# Generate a synthetic dataset with categorical features
X, y = make_classification(n_samples=1000000, n_features=10, n_informative=5, n_redundant=0, n_classes=2, random_state=42)
# Add 5 categorical features
X_cat = np.random.choice(['A', 'B', 'C', 'D'], size=(X.shape[0], 5))
X_combined = np.hstack((X, X_cat))
# Prepare dataframe with categorical type
X_df = pd.DataFrame(X_combined, columns=[f'feature_{i}' for i in range(15)])
# Set data type for categorical variables
categorical_features = [f'feature_{i}' for i in range(10, 15)]
for feature in [f'feature_{i}' for i in range(15)]:
if feature in categorical_features:
X_df[feature] = X_df[feature].astype('category')
else:
X_df[feature] = X_df[feature].astype('float')
# One-hot encoding
start_time = time.perf_counter()
transformer_ohe = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
X_ohe = transformer_ohe.fit_transform(X_df)
xgb_ohe = XGBClassifier(n_estimators=100)
xgb_ohe.fit(X_ohe, y)
ohe_time = time.perf_counter() - start_time
print(f"One-hot encoding time: {ohe_time:.4f} seconds")
# Ordinal encoding
start_time = time.perf_counter()
transformer_oe = ColumnTransformer(transformers=[('encoder', OrdinalEncoder(), categorical_features)], remainder='passthrough')
X_oe = transformer_oe.fit_transform(X_df)
xgb_oe = XGBClassifier(n_estimators=100)
xgb_oe.fit(X_oe, y)
oe_time = time.perf_counter() - start_time
print(f"Ordinal encoding time: {oe_time:.4f} seconds")
# XGBoost with enable_categorical
start_time = time.perf_counter()
xgb_clf = XGBClassifier(n_estimators=100, enable_categorical=True)
xgb_clf.fit(X_df, y)
xgb_cat_time = time.perf_counter() - start_time
print(f"XGBoost native time: {xgb_cat_time:.4f} seconds")
In this example, we first generate a synthetic dataset using scikit-learn’s make_classification
function, creating 10,000 samples with 10 features, 5 of which are informative. We then add 5 categorical features to the dataset using np.random.choice
, simulating a real-world scenario where both numerical and categorical features are present.
We then wrap the dataset in a Pandas DataFrame and set the approprate data type required for native XGBoost categorical input variable support.
Next, we apply one-hot encoding to the categorical features using scikit-learn’s OneHotEncoder
and measure the execution time. Similarly, we apply ordinal encoding using OrdinalEncoder
and measure its execution time.
Finally, we train an XGBoost classifier using the XGBClassifier
with the enable_categorical
parameter set to True
. This allows XGBoost to handle the categorical features natively without the need for manual encoding. We measure the execution time of the XGBoost training process, which includes the internal categorical encoding.
The output of this code will display the execution times for each encoding approach. For example:
One-hot encoding time: 5.0626 seconds
Ordinal encoding time: 4.3550 seconds
XGBoost native time: 3.2768 seconds
Although the exact times may vary depending on your system, you should observe that XGBoost’s native categorical encoding is significantly faster than manual one-hot encoding and ordinal encoding.
The benefit of using XGBoost’s enable_categorical
is that it handles the encoding internally, simplifying the preprocessing pipeline, and eliminating the need for additional encoding steps.
By leveraging XGBoost’s efficient categorical encoding, you can streamline your data preprocessing workflow and focus on other aspects of your machine learning pipeline. This is particularly valuable when working with large datasets or when you need to quickly iterate and experiment with different models and features.