The enable_categorical
parameter in XGBoost allows for native handling of categorical features.
It requires first specifying the feature types as 'category'
in your Pandas DataFrame
and setting the enable_categorical
parameter to True
when initializing the XGBoost model, you can streamline your data preparation process and improve the efficiency of your workflow.
# XGBoosting.com
# Configure XGBoost "enable_categorical" Parameter
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
# Convert a subset of columns to categorical
categorical_features = ['feature_5', 'feature_7', 'feature_13']
for feature in categorical_features:
X[feature] = pd.cut(X[feature], bins=4, labels=False).astype('category')
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the XGBoost classifier with enable_categorical=True
model = XGBClassifier(enable_categorical=True, eval_metric='mlogloss')
# Fit the model
model.fit(X_train, y_train)
# Make predictions on the test set
predictions = model.predict(X_test)
# Display the first few predictions
print(predictions[:10])
In this example, we generate a synthetic dataset using make_classification
from scikit-learn. We then convert the dataset into a pandas DataFrame
for easy manipulation. To demonstrate the usage of enable_categorical
, we convert a subset of columns ('feature_5'
, 'feature_7'
, and 'feature_13'
) into categorical variables using pd.cut()
and setting the data type to 'category'
.
Next, we split the dataset into training and testing sets using train_test_split
. We initialize an XGBoost classifier (XGBClassifier
) with enable_categorical=True
.
We fit the model on the training data using model.fit()
and make predictions on the test set using model.predict()
. Finally, we display the first few predictions to verify that the model has been trained and can generate predictions.
By leveraging the enable_categorical
parameter, XGBoost automatically handles the categorical features in the dataset, applying an efficient encoding scheme optimized for tree-based algorithms. This simplifies the data preprocessing step and allows XGBoost to effectively learn from datasets containing a mix of numeric and categorical variables.