Encode Categorical Features As Dummy Variables for XGBoost

XGBoost, like most machine learning algorithms, requires numerical input data.

When working with datasets containing categorical variables, these features need to be converted into a numerical representation before training the model.

One common approach is to create dummy variables (also known as one-hot encoding), where each unique category value is converted into a new binary feature column.

This example demonstrates how to efficiently convert categorical variables into dummy variables using the get_dummies() function from the pandas library before training an XGBoost model.

import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# Create a synthetic dataset with categorical features
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
        'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
        'Target': [0, 1, 0, 1, 0]}
df = pd.DataFrame(data)

# Convert categorical features to dummy variables
df_encoded = pd.get_dummies(df, columns=['Color', 'Size'])

# Split the data into features and target
X = df_encoded.drop('Target', axis=1)
y = df_encoded['Target']

# Initialize and train an XGBoost model
model = XGBClassifier(random_state=42)
model.fit(X, y)

Here’s how it works:

We create a small synthetic dataset df with categorical features ‘Color’ and ‘Size’, along with a binary ‘Target’ column.
We use pandas’ get_dummies() function to convert the specified categorical columns into dummy variables. This creates a new binary feature column for each unique category value.
We split the encoded dataset into features X (dropping the ‘Target’ column) and target variable y.
We initialize an XGBoost classifier with a fixed random seed for reproducibility and train it on the encoded features and target.
To make a prediction on new data, we first convert the categorical variables in the new data to dummy variables using get_dummies(), ensuring it has the same columns as the training data. Then, we pass the encoded new data to the trained model for prediction.

By using pandas’ get_dummies() function, we can quickly convert categorical variables into the required format for training an XGBoost model, enabling the algorithm to effectively utilize the information contained within these features.

A limitation of this approach is that the get_dummies() function cannot remember how to map categories to binary vectors. This is a problem when we want to consistently encode new data the same as we did for the training data.

The solution is to use the scikit-learn’s OneHotEncoder that will remember how to map categorical variables to a dummy variable encoding.

See Also