XGBoost Early Stopping With Random Search

Random search is an effective technique for exploring a wide range of hyperparameter combinations in XGBoost. When combined with early stopping and cross-validation, it helps prevent overfitting and identifies the optimal model configuration.

However, the RandomizedSearchCV class from Scikit-Learn does not inherently support using a separate validation set for early stopping within each cross-validation fold. To correctly implement random search with early stopping, we need to perform the search manually.

The following example demonstrates how to conduct a manual random search over XGBoost hyperparameters, including learning rate, max depth, subsample, and colsample_bytree, while using a validation set for early stopping in each fold of the cross-validation. We’ll use a synthetic classification dataset for this illustration.

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from scipy.stats import uniform

# Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, random_state=42)

# Configure cross-validation and early stopping
n_splits = 5
early_stopping_rounds = 10
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Define hyperparameter distributions for random search
param_distributions = {
    'learning_rate': ('uniform', 0.01, 0.3),
    'max_depth': ('choice', [3, 6, 9, 12]),
    'subsample': ('uniform', 0.5, 1.0),
    'colsample_bytree': ('uniform', 0.4, 1.0)
}

# Function to sample parameters based on their distribution type
def sample_param(distribution):
    if distribution[0] == 'uniform':
        return uniform(distribution[1], distribution[2] - distribution[1]).rvs()
    elif distribution[0] == 'choice':
        return np.random.choice(distribution[1])
    else:
        raise ValueError(f"Unsupported distribution type: {distribution[0]}")

# Perform random search with early stopping
n_iterations = 20
best_params = None
best_score = 0

for _ in range(n_iterations):
    test_scores = []
    best_rounds = []
    params = {k: sample_param(v) for k, v in param_distributions.items()}

    for train_index, test_index in kf.split(X, y):
        X_train_fold, X_test_fold = X[train_index], X[test_index]
        y_train_fold, y_test_fold = y[train_index], y[test_index]

        # Split train set into train and validation
        X_train_fold, X_val, y_train_fold, y_val = train_test_split(X_train_fold, y_train_fold, test_size=0.2, random_state=42)

        # Prepare the model
        model = xgb.XGBClassifier(n_estimators=100,
                                  learning_rate=params['learning_rate'],
                                  max_depth=int(params['max_depth']),  # max_depth should be an int
                                  subsample=params['subsample'],
                                  colsample_bytree=params['colsample_bytree'],
                                  objective='binary:logistic',
                                  random_state=42,
                                  early_stopping_rounds=early_stopping_rounds) # fixed early stopping

        # Fit model on train fold and use validation for early stopping
        model.fit(X_train_fold, y_train_fold, eval_set=[(X_val, y_val)], verbose=False)

        # Predict on test set
        y_pred_test = model.predict(X_test_fold)
        test_score = accuracy_score(y_test_fold, y_pred_test)
        test_scores.append(test_score)

    # Compute average score across all folds
    average_score = np.mean(test_scores)
    if average_score > best_score:
        best_score = average_score
        best_params = params

print(f"Best Parameters: {best_params}")
print(f"Best CV Average Accuracy: {best_score}")

We start by creating a synthetic binary classification dataset using make_classification from Scikit-Learn.

Next, we set up the cross-validation and early stopping parameters. We specify the number of splits (n_splits) and the number of rounds to wait for improvement (early_stopping_rounds). We use StratifiedKFold to ensure that the class distribution is preserved in each fold.

We define the hyperparameter distributions for random search, and the ranges for each parameter.

We initialize variables to track the best parameters and the best score.

We start a loop that runs for a specified number of iterations (n_iterations). In each iteration, we sample a set of hyperparameters from the defined distributions using our custom sample_param function.

Inside the iteration, we perform stratified k-fold cross-validation. We split the data into train and test folds based on the indices provided by StratifiedKFold. We further split the train fold into a training set and a validation set using train_test_split. This validation set will be used for early stopping.

We create an instance of the XGBClassifier with the sampled hyperparameter values and set the early stopping rounds.

We fit the model on the training fold using model.fit(), specifying the validation set (X_val, y_val) for early stopping via the eval_set parameter. The model will stop training if no improvement is observed for early_stopping_rounds consecutive rounds.

After training, we predict on the test fold using model.predict() and calculate the accuracy score using accuracy_score from Scikit-Learn. We append the accuracy score to the test_scores list.

After the cross-validation loop finishes for the current iteration, we compute the average accuracy score across all folds using np.mean(test_scores).

We compare the average score with the current best score. If the average score is higher, we update the best score and the corresponding best parameters.

Finally, we print the best parameters and the best cross-validation average accuracy.

By combining random search with stratified k-fold cross-validation and early stopping, we can efficiently explore a wide range of hyperparameter combinations and find the best set of hyperparameters that maximize the model’s performance. Random search samples hyperparameter values from the defined distributions, and for each sampled configuration, cross-validation with early stopping is performed to assess the model’s performance and prevent overfitting.

This approach allows us to tune the hyperparameters of the XGBoost model in a more automated and efficient manner, helping to find the optimal configuration that generalizes well to unseen data.

See Also