Tune XGBoost "reg_alpha" Parameter

The reg_alpha parameter in XGBoost controls the L1 regularization term, which adds a penalty proportional to the absolute value of the coefficients.

Tuning reg_alpha can help prevent overfitting by shrinking the coefficients of less important features, leading to a simpler and more generalizable model.

This example demonstrates how to tune the reg_alpha hyperparameter using grid search with cross-validation to find the optimal value that balances regularization and model performance.

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error

# Create a synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Configure cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define hyperparameter grid
param_grid = {
    'reg_alpha': [0, 0.1, 0.5, 1, 5, 10]
}

# Set up XGBoost regressor
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_search.fit(X, y)

# Get results
print(f"Best reg_alpha: {grid_search.best_params_['reg_alpha']}")
print(f"Best CV MSE: {-grid_search.best_score_:.4f}")

# Plot reg_alpha vs. MSE
import matplotlib.pyplot as plt
results = grid_search.cv_results_

plt.figure(figsize=(10, 6))
plt.plot(param_grid['reg_alpha'], -results['mean_test_score'], marker='o', linestyle='-', color='b')
plt.fill_between(param_grid['reg_alpha'], -results['mean_test_score'] + results['std_test_score'],
                 -results['mean_test_score'] - results['std_test_score'], alpha=0.1, color='b')
plt.title('Reg Alpha vs. MSE')
plt.xlabel('Reg Alpha')
plt.ylabel('CV Average MSE')
plt.grid(True)
plt.show()

The resulting plot may look as follows:

xgboost tune reg_alpha

In this example, we create a synthetic regression dataset using scikit-learn’s make_regression function. We then set up a KFold cross-validation object for splitting the data into training and validation sets.

We define a hyperparameter grid param_grid that specifies the range of reg_alpha values we want to test, including 0 (no regularization) and several positive values.

We create an instance of the XGBRegressor with basic hyperparameters set, such as n_estimators and learning_rate. We then perform the grid search using GridSearchCV, providing the model, parameter grid, cross-validation object, scoring metric (negative mean squared error), and the number of CPU cores to use for parallel computation.

After fitting the grid search object with grid_search.fit(X, y), we can access the best reg_alpha value and the corresponding best cross-validation mean squared error using grid_search.best_params_ and grid_search.best_score_, respectively.

Finally, we plot the relationship between the reg_alpha values and the cross-validation average mean squared error scores using matplotlib. We retrieve the results from grid_search.cv_results_ and plot the mean MSE scores along with the standard deviation as error bars. This visualization helps us understand how the choice of reg_alpha affects the model’s performance and guides us in selecting an appropriate value.

By tuning the reg_alpha hyperparameter using grid search with cross-validation, we can find the optimal level of L1 regularization that minimizes overfitting and improves the model’s generalization performance.

See Also