XGboost Min-Max Scaling Numerical Input Features

Data

XGBoost is known for its robustness and ability to handle a wide range of input features without extensive preprocessing.

Nevertheless, we may want to scale numerical input variables that have large scales or significantly different ranges.

In this example, we’ll generate a synthetic dataset with large input values and compare the performance of two XGBoost models: one trained on the original dataset and another trained on the dataset with min-max scaled features in the range of -1 to 1.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
import xgboost as xgb

# Generate a synthetic dataset with large input values
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8,
                           n_redundant=2, n_clusters_per_class=1,
                           class_sep=2.0, scale=1000, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a copy of the training data and apply min-max scaling
scaler = MinMaxScaler(feature_range=(-1, 1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train an XGBoost model on the original dataset
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)

# Train an XGBoost model on the min-max scaled dataset
model_scaled = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_scaled.fit(X_train_scaled, y_train)

# Evaluate the models' performance
y_pred_original = model_original.predict(X_test)
y_pred_scaled = model_scaled.predict(X_test_scaled)

f1_original = f1_score(y_test, y_pred_original)
f1_scaled = f1_score(y_test, y_pred_scaled)

print(f"F1-score (Original): {f1_original:.4f}")
print(f"F1-score (Min-Max Scaled): {f1_scaled:.4f}")

The code above generates a synthetic dataset using make_classification from scikit-learn with 10 features, 8 of which are informative. The scale parameter is set to 1000 to create features with large input values.

The data is then split into train and test sets. A copy of the training data is created, and min-max scaling is applied using MinMaxScaler from scikit-learn with a feature range of -1 to 1.

Two XGBoost classifiers with the same hyperparameters are trained: one on the original dataset and another on the min-max scaled dataset. The performance of both models is evaluated using the F1-score metric.

Finally, the F1-scores of both models are printed for comparison.

By comparing the performance of the two models, you can assess the impact of min-max scaling on XGBoost’s performance when dealing with large input values. The difference in performance may vary depending on the specific dataset and problem at hand, but this example serves as a starting point for experimentation and analysis.

See Also