XGBoost requires all input features to be numeric.
When working with categorical features, you need to convert them to integers using a technique like label encoding.
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from xgboost import XGBClassifier
# Create a synthetic dataset
data = {'color': ['red', 'blue', 'green', 'red', 'green'],
'size': ['small', 'large', 'medium', 'medium', 'small'],
'target': [0, 1, 1, 0, 0]}
df = pd.DataFrame(data)
# Initialize LabelEncoder
le_color = LabelEncoder()
le_size = LabelEncoder()
# Encode categorical features
df['color_encoded'] = le_color.fit_transform(df['color'])
df['size_encoded'] = le_size.fit_transform(df['size'])
# Prepare data for XGBoost
X = df[['color_encoded', 'size_encoded']].values
y = df['target'].values
# Initialize and train XGBClassifier
model = XGBClassifier(random_state=42)
model.fit(X, y)
# New data for prediction
new_data = {'color': ['blue', 'red', 'green'],
'size': ['medium', 'small', 'large']}
new_df = pd.DataFrame(new_data)
# Encode new data
new_df['color_encoded'] = le_color.transform(new_df['color'])
new_df['size_encoded'] = le_size.transform(new_df['size'])
# Make predictions
X_new = new_df[['color_encoded', 'size_encoded']].values
predictions = model.predict(X_new)
# Print results
print("New Data:")
print(new_df)
print("\nPredictions:", predictions)
Here’s what’s happening:
We create a synthetic dataset
dfwith two categorical features:colorandsize, along with a binary target variable.We initialize instances of
LabelEncoderfor each categorical feature.We use
fit_transform()to encode the categorical features, creating new columnscolor_encodedandsize_encodedin the dataframe.We prepare the encoded features and target variable as numpy arrays
Xandyfor XGBoost.We initialize an
XGBClassifierwith arandom_statefor reproducibility and train it usingfit().For making predictions, we create a new dataframe
new_dfwith categorical features in the same format as the training data.We use the same
LabelEncoderinstances to transform the categorical features innew_df, ensuring consistency with the encoding used during training.We prepare the encoded features from
new_dfas a numpy arrayX_newand use the trained model to make predictions.
By using LabelEncoder, we convert the categorical features into integers that XGBoost can handle. It’s important to use the same LabelEncoder instances for encoding the new data to ensure the same integer mapping is used as during training.