The `reg_lambda`

parameter in XGBoost controls the L2 regularization term, which helps prevent overfitting by adding a penalty to the model’s objective function based on the sum of squared weights.

By tuning `reg_lambda`

, you can find the optimal balance between model complexity and generalization performance.

This example demonstrates how to tune the `reg_lambda`

hyperparameter using grid search with cross-validation to find the best value that minimizes overfitting and improves the model’s ability to generalize to unseen data.

```
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error
# Create a synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Configure cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
# Define hyperparameter grid
param_grid = {
'reg_lambda': [0, 0.1, 0.5, 1, 5, 10, 50, 100]
}
# Set up XGBoost regressor
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_search.fit(X, y)
# Get results
print(f"Best reg_lambda: {grid_search.best_params_['reg_lambda']}")
print(f"Best CV MSE: {-grid_search.best_score_:.4f}")
# Plot reg_lambda vs. MSE
import matplotlib.pyplot as plt
results = grid_search.cv_results_
plt.figure(figsize=(10, 6))
plt.semilogx(param_grid['reg_lambda'], -results['mean_test_score'], marker='o', linestyle='-', color='b')
plt.fill_between(param_grid['reg_lambda'], -results['mean_test_score'] - results['std_test_score'],
-results['mean_test_score'] + results['std_test_score'], alpha=0.1, color='b')
plt.title('Reg Lambda vs. MSE')
plt.xlabel('Reg Lambda (log scale)')
plt.ylabel('CV Average MSE')
plt.grid(True)
plt.show()
```

The resulting plot may look as follows:

In this example, we create a synthetic regression dataset using scikit-learn’s `make_regression`

function. We then set up a `KFold`

cross-validation object to split the data into multiple folds for evaluation.

We define a hyperparameter grid `param_grid`

that specifies the range of `reg_lambda`

values we want to test. In this case, we consider values from 0 to 100, with a mix of small and large values to cover a wide range.

We create an instance of the `XGBRegressor`

with some basic hyperparameters set, such as `n_estimators`

and `learning_rate`

. We then perform the grid search using `GridSearchCV`

, providing the model, parameter grid, cross-validation object, scoring metric (negative mean squared error), and the number of CPU cores to use for parallel computation.

After fitting the grid search object with `grid_search.fit(X, y)`

, we can access the best `reg_lambda`

value and the corresponding best cross-validation mean squared error using `grid_search.best_params_`

and `grid_search.best_score_`

, respectively.

Finally, we plot the relationship between the `reg_lambda`

values and the cross-validation average MSE scores using matplotlib. We retrieve the results from `grid_search.cv_results_`

and plot the mean MSE scores along with the standard deviation as error bars. We use a logarithmic scale for the x-axis to better visualize the range of `reg_lambda`

values. This visualization helps us understand how the choice of `reg_lambda`

affects the model’s performance and guides us in selecting an appropriate value.

By tuning the `reg_lambda`

hyperparameter using grid search with cross-validation, we can find the optimal value that strikes a balance between model complexity and generalization performance. This helps prevent overfitting and ensures that the model performs well on unseen data.