XGBoost requires the target variable to be numerical.
When working with categorical target variables, they must be converted to integers before training an XGBoost model.
scikit-learn’s LabelEncoder provides a simple and efficient way to perform this integer encoding.
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
import numpy as np
# Synthetic feature matrix X
X = np.array([[2.5, 1.0, 3.0],
[5.0, 2.0, 4.0],
[3.0, 1.5, 3.5],
[1.0, 0.5, 2.0],
[4.5, 1.8, 4.2],
[2.8, 1.2, 3.2]])
# Example categorical target variable
y = ['cat', 'dog', 'cat', 'bird', 'dog', 'cat']
# Initialize LabelEncoder
le = LabelEncoder()
# Fit and transform the target variable
y_encoded = le.fit_transform(y)
# Initialize and train XGBoost model
model = XGBClassifier(random_state=42)
model.fit(X, y_encoded)
# New data for prediction
X_new = np.array([[3.2, 1.3, 3.4],
[1.5, 0.8, 2.3]])
# Make predictions
predictions = model.predict(X_new)
predicted_labels = le.inverse_transform(predictions)
print("Predicted labels:", predicted_labels)
Here’s a step-by-step breakdown:
Import the necessary classes:
LabelEncoderfromsklearn.preprocessingfor encoding the target variable, andXGBClassifierfromxgboostfor building the XGBoost model.Initialize a
LabelEncoderobject. This object will learn the unique categories in the target variable and assign each category an integer value.Fit the
LabelEncoderon the categorical target variableyand transform it to integer encoded values usingfit_transform. This step learns the mapping from categories to integers and applies the mapping to the target variable in one go.Initialize an
XGBClassifierwith any desired hyperparameters. Here, we set arandom_statefor reproducibility.Train the XGBoost model using the integer encoded target variable
y_encodedand your feature matrixX(not shown in this snippet).If you need to interpret the model’s predictions in terms of the original categorical labels, use
LabelEncoder’sinverse_transformmethod. This method takes the model’s integer predictions and maps them back to their corresponding categorical labels.