When working with machine learning models, it’s common to preprocess numerical input features to improve the model’s performance.
While XGBoost is known for its robustness to various input feature scales, applying a power transform to skewed or non-normal distributions typically does not impact the model’s predictive power. Nevertheless, we may want to power transform input variables for other reasons.
Power transform of numerical input variables involves applying a mathematical transformation, such as the Box-Cox or Yeo-Johnson method, to make the data more closely resemble a normal distribution, thereby stabilizing variance and improving the performance of machine learning algorithms.
In this example, we’ll demonstrate the effect of applying a power transform to numerical input features on XGBoost’s performance using a synthetic dataset with skewed input values.
We’ll compare the performance of two XGBoost models: one trained on the original dataset and another trained on the dataset with power-transformed features.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import f1_score
import xgboost as xgb
import numpy as np
# Generate a synthetic dataset with skewed input features
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8,
n_redundant=2, n_clusters_per_class=1,
class_sep=2.0, random_state=42)
X = np.exp(X) # Apply exponential function to create skewed features
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a copy of the training data and apply power transform
pt = PowerTransformer(method='yeo-johnson', standardize=True)
X_train_transformed = pt.fit_transform(X_train)
X_test_transformed = pt.transform(X_test)
# Train an XGBoost model on the original dataset
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)
# Train an XGBoost model on the power-transformed dataset
model_transformed = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_transformed.fit(X_train_transformed, y_train)
# Evaluate the models' performance
y_pred_original = model_original.predict(X_test)
y_pred_transformed = model_transformed.predict(X_test_transformed)
f1_original = f1_score(y_test, y_pred_original)
f1_transformed = f1_score(y_test, y_pred_transformed)
print(f"F1-score (Original): {f1_original:.4f}")
print(f"F1-score (Power Transformed): {f1_transformed:.4f}")
In this example, we use make_classification
from scikit-learn to generate a synthetic dataset with 10 features, 8 of which are informative. We then apply an exponential function to the generated features to create skewed distributions.
We split the data into train and test sets and create two versions of the dataset: one with the original skewed features and another with power-transformed features using the Yeo-Johnson transform from scikit-learn’s PowerTransformer
. This transformer applies a power transform to make the data more Gaussian-like.
Next, we train two XGBoost classifiers with the same hyperparameters: one on the original dataset and another on the power-transformed dataset. We evaluate the performance of both models using the F1-score.
Finally, we print the F1-scores of both models to compare their performance.
Applying a power transform to input features is often not beneficial with XGBoost.
However, the impact of power transforms on XGBoost’s performance can vary depending on the specific dataset and problem at hand. It’s essential to experiment with different data preprocessing techniques, including power transforms, to determine which approach yields the best results for your particular use case.