Label Encode Categorical Input Variables for XGBoost

XGBoost requires all input features to be numeric.

When working with categorical features, you need to convert them to integers using a technique like label encoding.

from sklearn.preprocessing import LabelEncoder
import pandas as pd
from xgboost import XGBClassifier

# Create a synthetic dataset
data = {'color': ['red', 'blue', 'green', 'red', 'green'],
        'size': ['small', 'large', 'medium', 'medium', 'small'],
        'target': [0, 1, 1, 0, 0]}
df = pd.DataFrame(data)

# Initialize LabelEncoder
le_color = LabelEncoder()
le_size = LabelEncoder()

# Encode categorical features
df['color_encoded'] = le_color.fit_transform(df['color'])
df['size_encoded'] = le_size.fit_transform(df['size'])

# Prepare data for XGBoost
X = df[['color_encoded', 'size_encoded']].values
y = df['target'].values

# Initialize and train XGBClassifier
model = XGBClassifier(random_state=42)
model.fit(X, y)

# New data for prediction
new_data = {'color': ['blue', 'red', 'green'],
            'size': ['medium', 'small', 'large']}
new_df = pd.DataFrame(new_data)

# Encode new data
new_df['color_encoded'] = le_color.transform(new_df['color'])
new_df['size_encoded'] = le_size.transform(new_df['size'])

# Make predictions
X_new = new_df[['color_encoded', 'size_encoded']].values
predictions = model.predict(X_new)

# Print results
print("New Data:")
print(new_df)
print("\nPredictions:", predictions)

Here’s what’s happening:

We create a synthetic dataset df with two categorical features: color and size, along with a binary target variable.
We initialize instances of LabelEncoder for each categorical feature.
We use fit_transform() to encode the categorical features, creating new columns color_encoded and size_encoded in the dataframe.
We prepare the encoded features and target variable as numpy arrays X and y for XGBoost.
We initialize an XGBClassifier with a random_state for reproducibility and train it using fit().
For making predictions, we create a new dataframe new_df with categorical features in the same format as the training data.
We use the same LabelEncoder instances to transform the categorical features in new_df, ensuring consistency with the encoding used during training.
We prepare the encoded features from new_df as a numpy array X_new and use the trained model to make predictions.

By using LabelEncoder, we convert the categorical features into integers that XGBoost can handle. It’s important to use the same LabelEncoder instances for encoding the new data to ensure the same integer mapping is used as during training.

See Also