XGBoost requires the target variable to be numerical.
When working with categorical target variables, they must be converted to integers before training an XGBoost model.
scikit-learn’s LabelEncoder
provides a simple and efficient way to perform this integer encoding.
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
import numpy as np
# Synthetic feature matrix X
X = np.array([[2.5, 1.0, 3.0],
[5.0, 2.0, 4.0],
[3.0, 1.5, 3.5],
[1.0, 0.5, 2.0],
[4.5, 1.8, 4.2],
[2.8, 1.2, 3.2]])
# Example categorical target variable
y = ['cat', 'dog', 'cat', 'bird', 'dog', 'cat']
# Initialize LabelEncoder
le = LabelEncoder()
# Fit and transform the target variable
y_encoded = le.fit_transform(y)
# Initialize and train XGBoost model
model = XGBClassifier(random_state=42)
model.fit(X, y_encoded)
# New data for prediction
X_new = np.array([[3.2, 1.3, 3.4],
[1.5, 0.8, 2.3]])
# Make predictions
predictions = model.predict(X_new)
predicted_labels = le.inverse_transform(predictions)
print("Predicted labels:", predicted_labels)
Here’s a step-by-step breakdown:
Import the necessary classes:
LabelEncoder
fromsklearn.preprocessing
for encoding the target variable, andXGBClassifier
fromxgboost
for building the XGBoost model.Initialize a
LabelEncoder
object. This object will learn the unique categories in the target variable and assign each category an integer value.Fit the
LabelEncoder
on the categorical target variabley
and transform it to integer encoded values usingfit_transform
. This step learns the mapping from categories to integers and applies the mapping to the target variable in one go.Initialize an
XGBClassifier
with any desired hyperparameters. Here, we set arandom_state
for reproducibility.Train the XGBoost model using the integer encoded target variable
y_encoded
and your feature matrixX
(not shown in this snippet).If you need to interpret the model’s predictions in terms of the original categorical labels, use
LabelEncoder
’sinverse_transform
method. This method takes the model’s integer predictions and maps them back to their corresponding categorical labels.