Random search is an effective technique for exploring a wide range of hyperparameter combinations in XGBoost. When combined with early stopping and cross-validation, it helps prevent overfitting and identifies the optimal model configuration.

However, the `RandomizedSearchCV`

class from Scikit-Learn does not inherently support using a separate validation set for early stopping within each cross-validation fold. To correctly implement random search with early stopping, we need to perform the search manually.

The following example demonstrates how to conduct a manual random search over XGBoost hyperparameters, including learning rate, max depth, subsample, and colsample_bytree, while using a validation set for early stopping in each fold of the cross-validation. We’ll use a synthetic classification dataset for this illustration.

```
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from scipy.stats import uniform
# Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, random_state=42)
# Configure cross-validation and early stopping
n_splits = 5
early_stopping_rounds = 10
kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
# Define hyperparameter distributions for random search
param_distributions = {
'learning_rate': ('uniform', 0.01, 0.3),
'max_depth': ('choice', [3, 6, 9, 12]),
'subsample': ('uniform', 0.5, 1.0),
'colsample_bytree': ('uniform', 0.4, 1.0)
}
# Function to sample parameters based on their distribution type
def sample_param(distribution):
if distribution[0] == 'uniform':
return uniform(distribution[1], distribution[2] - distribution[1]).rvs()
elif distribution[0] == 'choice':
return np.random.choice(distribution[1])
else:
raise ValueError(f"Unsupported distribution type: {distribution[0]}")
# Perform random search with early stopping
n_iterations = 20
best_params = None
best_score = 0
for _ in range(n_iterations):
test_scores = []
best_rounds = []
params = {k: sample_param(v) for k, v in param_distributions.items()}
for train_index, test_index in kf.split(X, y):
X_train_fold, X_test_fold = X[train_index], X[test_index]
y_train_fold, y_test_fold = y[train_index], y[test_index]
# Split train set into train and validation
X_train_fold, X_val, y_train_fold, y_val = train_test_split(X_train_fold, y_train_fold, test_size=0.2, random_state=42)
# Prepare the model
model = xgb.XGBClassifier(n_estimators=100,
learning_rate=params['learning_rate'],
max_depth=int(params['max_depth']), # max_depth should be an int
subsample=params['subsample'],
colsample_bytree=params['colsample_bytree'],
objective='binary:logistic',
random_state=42,
early_stopping_rounds=early_stopping_rounds) # fixed early stopping
# Fit model on train fold and use validation for early stopping
model.fit(X_train_fold, y_train_fold, eval_set=[(X_val, y_val)], verbose=False)
# Predict on test set
y_pred_test = model.predict(X_test_fold)
test_score = accuracy_score(y_test_fold, y_pred_test)
test_scores.append(test_score)
# Compute average score across all folds
average_score = np.mean(test_scores)
if average_score > best_score:
best_score = average_score
best_params = params
print(f"Best Parameters: {best_params}")
print(f"Best CV Average Accuracy: {best_score}")
```

We start by creating a synthetic binary classification dataset using `make_classification`

from Scikit-Learn.

Next, we set up the cross-validation and early stopping parameters. We specify the number of splits (`n_splits`

) and the number of rounds to wait for improvement (`early_stopping_rounds`

). We use `StratifiedKFold`

to ensure that the class distribution is preserved in each fold.

We define the hyperparameter distributions for random search, and the ranges for each parameter.

We initialize variables to track the best parameters and the best score.

We start a loop that runs for a specified number of iterations (`n_iterations`

). In each iteration, we sample a set of hyperparameters from the defined distributions using our custom `sample_param`

function.

Inside the iteration, we perform stratified k-fold cross-validation. We split the data into train and test folds based on the indices provided by `StratifiedKFold`

. We further split the train fold into a training set and a validation set using `train_test_split`

. This validation set will be used for early stopping.

We create an instance of the `XGBClassifier`

with the sampled hyperparameter values and set the early stopping rounds.

We fit the model on the training fold using `model.fit()`

, specifying the validation set (`X_val`

, `y_val`

) for early stopping via the `eval_set`

parameter. The model will stop training if no improvement is observed for `early_stopping_rounds`

consecutive rounds.

After training, we predict on the test fold using `model.predict()`

and calculate the accuracy score using `accuracy_score`

from Scikit-Learn. We append the accuracy score to the `test_scores`

list.

After the cross-validation loop finishes for the current iteration, we compute the average accuracy score across all folds using `np.mean(test_scores)`

.

We compare the average score with the current best score. If the average score is higher, we update the best score and the corresponding best parameters.

Finally, we print the best parameters and the best cross-validation average accuracy.

By combining random search with stratified k-fold cross-validation and early stopping, we can efficiently explore a wide range of hyperparameter combinations and find the best set of hyperparameters that maximize the model’s performance. Random search samples hyperparameter values from the defined distributions, and for each sampled configuration, cross-validation with early stopping is performed to assess the model’s performance and prevent overfitting.

This approach allows us to tune the hyperparameters of the XGBoost model in a more automated and efficient manner, helping to find the optimal configuration that generalizes well to unseen data.