XGBoosting Home | About | Contact | Examples

XGBoost Robust to Correlated Input Features (multi-collinearity)

XGBoost is remarkably resilient to correlated input features (multi-collinearity), maintaining strong predictive performance even when faced with this common data challenge.

Unlike many other machine learning algorithms that can struggle or produce unstable results when input variables are highly correlated, XGBoost’s tree-based ensemble approach allows it to effectively learn from such data without significant negative impact.

from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Generate a synthetic dataset with correlated features
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       n_targets=1, noise=0.1, shuffle=True,
                       random_state=42, effective_rank=2)

# Initialize and train an XGBoost model
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X, y)

# Evaluate the model's performance
y_pred = model.predict(X)
mse = mean_squared_error(y, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

In this example:

  1. We generate a synthetic dataset using scikit-learn’s make_regression function, specifying effective_rank=2 to introduce correlation among the input features.

  2. We initialize an XGBRegressor with basic hyperparameters and train it on the correlated dataset.

  3. We evaluate the model’s performance by making predictions on the training data and calculating the mean squared error.

The low mean squared error demonstrates XGBoost’s ability to learn effectively from the correlated data. This robustness to multi-collinearity is a significant advantage in real-world scenarios where input features often exhibit some degree of correlation.

While a direct comparison to another model’s performance on the same dataset would further illustrate XGBoost’s strength in this area, the example showcases its capability to handle correlated data without the need for extensive feature engineering or preprocessing.



See Also