XGBoost, like most machine learning algorithms, requires numerical input data.
When working with datasets containing categorical variables, these features need to be converted into a numerical representation before training the model.
One common approach is to create dummy variables (also known as one-hot encoding), where each unique category value is converted into a new binary feature column.
This example demonstrates how to efficiently convert categorical variables into dummy variables using the get_dummies()
function from the pandas library before training an XGBoost model.
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Create a synthetic dataset with categorical features
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
'Target': [0, 1, 0, 1, 0]}
df = pd.DataFrame(data)
# Convert categorical features to dummy variables
df_encoded = pd.get_dummies(df, columns=['Color', 'Size'])
# Split the data into features and target
X = df_encoded.drop('Target', axis=1)
y = df_encoded['Target']
# Initialize and train an XGBoost model
model = XGBClassifier(random_state=42)
model.fit(X, y)
Here’s how it works:
We create a small synthetic dataset
df
with categorical features ‘Color’ and ‘Size’, along with a binary ‘Target’ column.We use pandas’
get_dummies()
function to convert the specified categorical columns into dummy variables. This creates a new binary feature column for each unique category value.We split the encoded dataset into features
X
(dropping the ‘Target’ column) and target variabley
.We initialize an XGBoost classifier with a fixed random seed for reproducibility and train it on the encoded features and target.
To make a prediction on new data, we first convert the categorical variables in the new data to dummy variables using
get_dummies()
, ensuring it has the same columns as the training data. Then, we pass the encoded new data to the trained model for prediction.
By using pandas’ get_dummies()
function, we can quickly convert categorical variables into the required format for training an XGBoost model, enabling the algorithm to effectively utilize the information contained within these features.
A limitation of this approach is that the get_dummies()
function cannot remember how to map categories to binary vectors. This is a problem when we want to consistently encode new data the same as we did for the training data.
The solution is to use the scikit-learn’s OneHotEncoder
that will remember how to map categorical variables to a dummy variable encoding.