Label Encode Categorical Target Variable for XGBoost

XGBoost requires the target variable to be numerical.

When working with categorical target variables, they must be converted to integers before training an XGBoost model.

scikit-learn’s LabelEncoder provides a simple and efficient way to perform this integer encoding.

from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
import numpy as np

# Synthetic feature matrix X
X = np.array([[2.5, 1.0, 3.0],
              [5.0, 2.0, 4.0],
              [3.0, 1.5, 3.5],
              [1.0, 0.5, 2.0],
              [4.5, 1.8, 4.2],
              [2.8, 1.2, 3.2]])

# Example categorical target variable
y = ['cat', 'dog', 'cat', 'bird', 'dog', 'cat']

# Initialize LabelEncoder
le = LabelEncoder()

# Fit and transform the target variable
y_encoded = le.fit_transform(y)

# Initialize and train XGBoost model
model = XGBClassifier(random_state=42)
model.fit(X, y_encoded)

# New data for prediction
X_new = np.array([[3.2, 1.3, 3.4],
                  [1.5, 0.8, 2.3]])

# Make predictions
predictions = model.predict(X_new)
predicted_labels = le.inverse_transform(predictions)

print("Predicted labels:", predicted_labels)

Here’s a step-by-step breakdown:

Import the necessary classes: LabelEncoder from sklearn.preprocessing for encoding the target variable, and XGBClassifier from xgboost for building the XGBoost model.
Initialize a LabelEncoder object. This object will learn the unique categories in the target variable and assign each category an integer value.
Fit the LabelEncoder on the categorical target variable y and transform it to integer encoded values using fit_transform. This step learns the mapping from categories to integers and applies the mapping to the target variable in one go.
Initialize an XGBClassifier with any desired hyperparameters. Here, we set a random_state for reproducibility.
Train the XGBoost model using the integer encoded target variable y_encoded and your feature matrix X (not shown in this snippet).
If you need to interpret the model’s predictions in terms of the original categorical labels, use LabelEncoder’s inverse_transform method. This method takes the model’s integer predictions and maps them back to their corresponding categorical labels.

See Also