Tune XGBoost "alpha" Parameter

The alpha parameter in XGBoost controls the L1 regularization term, which can help with feature selection and prevent overfitting by encouraging sparse solutions.

Higher values of alpha lead to more regularization and can shrink the coefficients of less important features to exactly zero. However, setting alpha too high may lead to underfitting.

An alias for the alpha parameter is reg_alpha.

This example demonstrates how to tune the alpha hyperparameter using grid search with cross-validation to find the optimal value that balances regularization and performance.

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import r2_score

# Create a synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=10, noise=0.1, random_state=42)

# Configure cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define hyperparameter grid
param_grid = {
    'alpha': [0, 0.01, 0.1, 1, 10, 100]
}

# Set up XGBoost regressor
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='r2', n_jobs=-1, verbose=1)
grid_search.fit(X, y)

# Get results
print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best CV R^2 score: {grid_search.best_score_:.4f}")

# Plot alpha vs. R^2 score
import matplotlib.pyplot as plt
results = grid_search.cv_results_

plt.figure(figsize=(10, 6))
plt.semilogx(param_grid['alpha'], results['mean_test_score'], marker='o', linestyle='-', color='b')
plt.fill_between(param_grid['alpha'], results['mean_test_score'] - results['std_test_score'],
                 results['mean_test_score'] + results['std_test_score'], alpha=0.1, color='b')
plt.title('Alpha vs. R^2 Score')
plt.xlabel('Alpha (log scale)')
plt.ylabel('CV Average R^2 Score')
plt.grid(True)
plt.show()

The resulting plot may look as follows:

xgboost tune alpha

In this example, we create a synthetic regression dataset using scikit-learn’s make_regression function. We then set up a KFold cross-validation object to split the data into train and test sets.

We define a hyperparameter grid param_grid that specifies the range of alpha values we want to test, including 0 (no regularization), and increasing orders of magnitude up to 100.

We create an instance of the XGBRegressor with some basic hyperparameters set. We then perform the grid search using GridSearchCV, providing the model, parameter grid, cross-validation object, scoring metric (R^2), and the number of CPU cores to use for parallel computation.

After fitting the grid search object, we can access the best alpha value and the corresponding best cross-validation R^2 score.

Finally, we plot the relationship between the alpha values and the cross-validation average R^2 scores using matplotlib. We use a logarithmic scale for the x-axis since the alpha values span orders of magnitude. The plot includes error bars representing the standard deviation of the scores.

The optimal alpha value depends on the specific dataset and problem at hand. Setting alpha too high can lead to underfitting, while setting it too low may not provide enough regularization to prevent overfitting. It’s recommended to use grid search with cross-validation to find a good balance for each use case.

See Also