XGBoosting Home | About | Contact | Examples

XGBoost Early Stopping With Grid Search

Combining early stopping with grid search in XGBoost is a powerful technique to automatically tune hyperparameters and prevent overfitting.

Grid search explores different hyperparameter combinations, while early stopping determines the optimal number of boosting rounds for each combination.

To perform a grid search while correctly using a validation set for early stopping in each fold of the cross-validation, we will need to implement the grid search manually. This is because GridSearchCV from Scikit-Learn does not inherently support the use of a separate validation set for early stopping during each fold, as it manages the splitting internally for cross-validation and uses the same data for both tuning hyperparameters and early stopping, which is not ideal for this use case.

Below is an example that performs a manual grid search over a predefined set of hyperparameters, while using a separate validation set for early stopping within each cross-validation fold. The code demonstrates how to adjust the learning rate and max depth of the model as hyperparameters:

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score

# Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, random_state=42)

# Configure cross-validation and early stopping
n_splits = 5
early_stopping_rounds = 10
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Define hyperparameter grid
param_grid = {
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 6, 9]
}

# Perform cross-validation with early stopping
best_params = None
best_score = 0

for learning_rate in param_grid['learning_rate']:
    for max_depth in param_grid['max_depth']:
        test_scores = []
        best_rounds = []
        for train_index, test_index in kf.split(X, y):
            X_train_fold, X_test_fold = X[train_index], X[test_index]
            y_train_fold, y_test_fold = y[train_index], y[test_index]

            # Split train set into train and validation
            X_train_fold, X_val, y_train_fold, y_val = train_test_split(X_train_fold, y_train_fold, test_size=0.2, random_state=42)

            # Prepare the model
            model = xgb.XGBClassifier(n_estimators=100,
                learning_rate=learning_rate,
                max_depth=max_depth,
                early_stopping_rounds=early_stopping_rounds,
                objective='binary:logistic',
                random_state=42)

            # Fit model on train fold and use validation for early stopping
            model.fit(X_train_fold, y_train_fold, eval_set=[(X_val, y_val)], verbose=False)

            # Predict on test set
            y_pred_test = model.predict(X_test_fold)
            test_score = accuracy_score(y_test_fold, y_pred_test)
            test_scores.append(test_score)

        # Compute average score across all folds
        average_score = np.mean(test_scores)
        if average_score > best_score:
            best_score = average_score
            best_params = {'learning_rate': learning_rate, 'max_depth': max_depth}

print(f"Best Parameters: {best_params}")
print(f"Best CV Average Accuracy: {best_score}")

We begin by creating a synthetic binary classification dataset using make_classification from scikit-learn.

We configure the cross-validation and early stopping parameters, specifying the number of splits (n_splits) and the number of rounds to wait for improvement (early_stopping_rounds). We use StratifiedKFold to ensure that the class distribution is preserved in each fold.

We define the hyperparameter grid (param_grid) that specifies the combinations of hyperparameters to explore during the grid search. In this example, we include learning_rate and max_depth.

We initialize variables to keep track of the best parameters and the best score.

We start a nested loop to iterate over each combination of hyperparameters in the grid. For each combination, we perform stratified k-fold cross-validation.

Inside the cross-validation loop, we split the data into train and test folds based on the indices provided by StratifiedKFold. We further split the train fold into a training set and a validation set using train_test_split. This validation set will be used for early stopping.

We create an instance of the XGBClassifier with the current hyperparameter values (learning_rate and max_depth) and set the early stopping rounds.

We fit the model on the training fold using model.fit(), specifying the validation set (X_val, y_val) for early stopping via the eval_set parameter. The model will monitor the performance on the validation set and stop training if no improvement is observed for early_stopping_rounds consecutive rounds.

After training, we predict on the test fold using model.predict() and calculate the accuracy score using accuracy_score from scikit-learn. We append the accuracy score to the test_scores list.

After the cross-validation loop finishes for the current hyperparameter combination, we compute the average accuracy score across all folds using np.mean(test_scores).

We compare the average score with the current best score. If the average score is higher, we update the best score and the corresponding best parameters.

Finally, we print the best parameters and the best cross-validation average accuracy.

By combining grid search with stratified k-fold cross-validation and early stopping, we can explore different hyperparameter combinations and find the best set of hyperparameters that maximize the model’s performance. The grid search iterates over each combination of hyperparameters, and for each combination, cross-validation with early stopping is performed to assess the model’s performance and prevent overfitting.

This approach allows us to tune the hyperparameters of the XGBoost model in a more efficient and automated manner, helping to find the optimal configuration that generalizes well to unseen data.



See Also