Ordinal Encode Categorical Features for XGBoost

XGBoost requires numerical input features.

When working with ordered categorical (ordinal) features, they must be converted to numbers before training an XGBoost model. Ordinal encoding maps each unique category to an integer while preserving the order of the categories, if one exists.

scikit-learn’s OrdinalEncoder provides a simple and efficient way to perform this ordinal encoding of categorical features. When combined with ColumnTransformer, it allows for seamless integration of the encoding step into a machine learning pipeline, especially when dealing with datasets containing a mix of categorical and numerical features.

from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
import numpy as np

# Synthetic feature matrix X with categorical and numerical features
X = np.array([['low', 10.0],
              ['medium', 20.0],
              ['high', 30.0],
              ['low', 15.0],
              ['high', 25.0],
              ['medium', 18.0]], dtype=object)
# Explicitly converting the second column to floats
X[:, 1] = X[:, 1].astype(float)


# Example target variable
y = [1.5, 2.3, 3.1, 1.8, 2.9, 2.2]

# Create ColumnTransformer to apply OrdinalEncoder to the first column
transformer = ColumnTransformer(transformers=[
    ('ordinal', OrdinalEncoder(), [0])],
    remainder='passthrough')

# Perform ordinal encoding
X = transformer.fit_transform(X)

# Define the xgboost model configuration
model = XGBRegressor(random_state=42)

# Fit the pipeline
model.fit(X, y)

# New data for prediction
X_new = np.array([['medium', 22],
                  ['low', 12]], dtype=object)
X_new[:, 1] = X_new[:, 1].astype(float)

# Perform ordinal encoding
X_new = transformer.transform(X_new)

# Make predictions using the pipeline
predictions = model.predict(X_new)

print("Predictions:", predictions)

Here’s a step-by-step breakdown:

Import the necessary classes: OrdinalEncoder for encoding the categorical features, ColumnTransformer for applying transformations to specific columns.
Create a ColumnTransformer named transformer. This transformer will apply OrdinalEncoder to the first column (index 0) of the input data, and pass through the second column (index 1) unchanged.
The transformer is then applied to the dataset converting the first column into integer values consistently.
Fit the model using the transformed input features X and target variable y.
To make predictions on new data X_new, simply pass it to the fitted ColumnTransformer’s transform method. The ColumnTransformer will automatically apply the learned transformations to the new data and use the trained model to make predictions.

By incorporating OrdinalEncoder into a ColumnTransformer, you can streamline the process of preparing your data for XGBoost, especially when dealing with datasets containing both categorical and numerical features. This approach ensures that the necessary transformations are applied consistently and efficiently, both during model training and when making predictions on new data.

See Also