XGBoost requires all input features to be numeric.
When working with categorical features, you need to convert them to integers using a technique like label encoding.
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from xgboost import XGBClassifier
# Create a synthetic dataset
data = {'color': ['red', 'blue', 'green', 'red', 'green'],
'size': ['small', 'large', 'medium', 'medium', 'small'],
'target': [0, 1, 1, 0, 0]}
df = pd.DataFrame(data)
# Initialize LabelEncoder
le_color = LabelEncoder()
le_size = LabelEncoder()
# Encode categorical features
df['color_encoded'] = le_color.fit_transform(df['color'])
df['size_encoded'] = le_size.fit_transform(df['size'])
# Prepare data for XGBoost
X = df[['color_encoded', 'size_encoded']].values
y = df['target'].values
# Initialize and train XGBClassifier
model = XGBClassifier(random_state=42)
model.fit(X, y)
# New data for prediction
new_data = {'color': ['blue', 'red', 'green'],
'size': ['medium', 'small', 'large']}
new_df = pd.DataFrame(new_data)
# Encode new data
new_df['color_encoded'] = le_color.transform(new_df['color'])
new_df['size_encoded'] = le_size.transform(new_df['size'])
# Make predictions
X_new = new_df[['color_encoded', 'size_encoded']].values
predictions = model.predict(X_new)
# Print results
print("New Data:")
print(new_df)
print("\nPredictions:", predictions)
Here’s what’s happening:
We create a synthetic dataset
df
with two categorical features:color
andsize
, along with a binary target variable.We initialize instances of
LabelEncoder
for each categorical feature.We use
fit_transform()
to encode the categorical features, creating new columnscolor_encoded
andsize_encoded
in the dataframe.We prepare the encoded features and target variable as numpy arrays
X
andy
for XGBoost.We initialize an
XGBClassifier
with arandom_state
for reproducibility and train it usingfit()
.For making predictions, we create a new dataframe
new_df
with categorical features in the same format as the training data.We use the same
LabelEncoder
instances to transform the categorical features innew_df
, ensuring consistency with the encoding used during training.We prepare the encoded features from
new_df
as a numpy arrayX_new
and use the trained model to make predictions.
By using LabelEncoder
, we convert the categorical features into integers that XGBoost can handle. It’s important to use the same LabelEncoder
instances for encoding the new data to ensure the same integer mapping is used as during training.