XGboost Standardize Numerical Input Features

Data

XGBoost is known for its robustness and ability to handle a wide range of input features without extensive preprocessing

Nevertheless, we may want to scale numerical input variables that have large scales or significantly different ranges.

Standardizing numerical input variables involves transforming the data so that it has a mean of zero and a standard deviation of one, which helps to normalize the range of the variables and improve the performance of many machine learning algorithms.

In this example, we’ll demonstrate the effect of standardizing numerical features on XGBoost’s performance using a synthetic dataset with large input values.

We’ll compare the performance of two XGBoost models: one trained on the original dataset and another trained on the dataset with standardized features.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Generate a synthetic dataset with large input values
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8,
                           n_redundant=2, n_clusters_per_class=1,
                           class_sep=2.0, scale=1000, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a copy of the training data and standardize it
X_train_scaled = StandardScaler().fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test)

# Train an XGBoost model on the original dataset
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)

# Train an XGBoost model on the standardized dataset
model_scaled = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_scaled.fit(X_train_scaled, y_train)

# Evaluate the models' performance
y_pred_original = model_original.predict(X_test)
y_pred_scaled = model_scaled.predict(X_test_scaled)

accuracy_original = accuracy_score(y_test, y_pred_original)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy (Original): {accuracy_original:.4f}")
print(f"Accuracy (Scaled): {accuracy_scaled:.4f}")

In this example, we use make_classification from scikit-learn to generate a synthetic dataset with 10 features, 8 of which are informative. We set the scale parameter to 1000 to create features with large input values.

We then split the data into train and test sets and create two versions of the dataset: one with the original values and another with standardized values using StandardScaler from scikit-learn.

Next, we train two XGBoost classifiers with the same hyperparameters: one on the original dataset and another on the standardized dataset. We evaluate the performance of both models using accuracy as the evaluation metric.

Finally, we print the accuracies of both models to compare their performance.

In most cases, the difference in performance between the two models may be negligible, demonstrating XGBoost’s robustness to input feature scales.

It’s worth noting that the impact of feature scaling on XGBoost’s performance can vary depending on the specific dataset and problem at hand. As a best practice, it’s recommended to experiment with both original and standardized features to determine which approach yields the best results for your particular use case.

See Also