Random Search XGBoost Hyperparameters

Random search is an alternative to grid search for finding optimal XGBoost hyperparameters.

Instead of exhaustively searching through a predefined grid, random search samples hyperparameter values randomly from a specified distribution.

This can be more efficient, especially when dealing with large hyperparameter spaces.

Here’s how to perform random search for XGBoost using scikit-learn:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from xgboost import XGBClassifier
from scipy.stats import uniform

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter distributions
param_dist = {
    'max_depth': [3, 5, 7, 9, 11],
    'min_child_weight': [1, 3, 5, 7],
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'learning_rate': uniform(0.01, 0.29)
}

# Create XGBoost classifier
xgb = XGBClassifier(n_estimators=100, objective='binary:logistic', random_state=42)

# Perform random search
random_search = RandomizedSearchCV(estimator=xgb, param_distributions=param_dist, n_iter=50, cv=3, n_jobs=-1, verbose=2, random_state=42)
random_search.fit(X_train, y_train)

# Print best parameters
print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_}")

In this example:

We load the breast cancer dataset and split it into train and test sets.
We define a parameter distribution param_dist. For max_depth and min_child_weight, we provide a list of discrete values to sample from. For subsample, colsample_bytree, and learning_rate, we use scipy.stats.uniform to define a continuous distribution to sample from. The first argument is the lower bound and the second is the range (upper bound - lower bound).
We create an XGBoost classifier xgb.
We create a RandomizedSearchCV object random_search, specifying the classifier, parameter distribution, number of iterations (n_iter), and number of cross-validation splits (cv). Setting random_state ensures reproducibility.
We fit random_search to the training data. This will randomly sample hyperparameters from param_dist and evaluate the model for each combination.
We print the best parameters and the corresponding best score.

Random search can be a good choice when you have a large hyperparameter space and limited computational resources. It allows you to explore a wide range of values without exhaustively searching through all possible combinations. The number of iterations n_iter controls the number of random configurations to try.

As with grid search, you can use the best parameters found by random search to train your final model on the full training set and evaluate its performance on the test set.

See Also