Both the lambda
and reg_lambda
parameters in XGBoost control the L2 regularization term, which helps to prevent overfitting by constraining the model’s complexity.
The lambda
parameter is preferred in the native XGBoost API, while reg_lambda
is used in the scikit-learn API, conforming to the scikit-learn convention.
The lambda
parameter cannot be used directly as an argument in scikit-learn as it will cause an SyntaxError
as it will be confused with the lambda
expression (anonymous function) in Python. Instead, the lambda
parameter can be used in scikit-learn by providing model parameters as a dict.
This example demonstrates how to use both parameters and confirms that they have the same effect on the model’s performance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Set up parameters for XGBoost
params_lambda = {
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'lambda': 1
}
params_reg_lambda = {
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'reg_lambda': 1
}
# Create two XGBoost classifiers, one using "lambda" and the other using "reg_lambda"
model_lambda = XGBClassifier(**params_lambda)
model_reg_lambda = XGBClassifier(**params_reg_lambda)
# Train both models on the training set
model_lambda.fit(X_train, y_train)
model_reg_lambda.fit(X_train, y_train)
# Make predictions on the test set
predictions_lambda = model_lambda.predict(X_test)
predictions_reg_lambda = model_reg_lambda.predict(X_test)
# Compare the results
assert (predictions_lambda == predictions_reg_lambda).all()
The example below demonstrates the same functionality using the native XGBoost API with DMatrix:
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Convert data to DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Set up parameters for XGBoost
params_lambda = {
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'lambda': 1
}
params_reg_lambda = {
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'reg_lambda': 1
}
# Train the models
model_lambda = xgb.train(params_lambda, dtrain, num_boost_round=10)
model_reg_lambda = xgb.train(params_reg_lambda, dtrain, num_boost_round=10)
# Make predictions on the test set
predictions_lambda = model_lambda.predict(dtest).round()
predictions_reg_lambda = model_reg_lambda.predict(dtest).round()
# Compare the results
assert (predictions_lambda == predictions_reg_lambda).all()
The lambda
and reg_lambda
parameters serve the same purpose in XGBoost, controlling the L2 regularization term. A smaller value (e.g., 0.1) will allow the model to be more complex and potentially overfit, while a larger value (e.g., 10) will constrain the model’s complexity and help prevent overfitting.
The main difference between the two is the API in which they are used. The lambda
parameter is used in the native XGBoost API, while reg_lambda
is used in the scikit-learn API, conforming to the scikit-learn convention.
When working with XGBoost, it is recommended to use lambda
when using the native XGBoost API and reg_lambda
when using the scikit-learn API. The choice between lambda
and reg_lambda
ultimately depends on the API being used and personal preference.