Grid Search XGBoost Hyperparameters

XGBoost is a powerful algorithm, but its performance heavily depends on the hyperparameters used.

Grid search is a systematic way to find the optimal combination of hyperparameters by exhaustively searching through a specified parameter space.

Here’s how to perform grid search for XGBoost using scikit-learn.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'learning_rate': [0.01, 0.1, 0.3]
}

# Create XGBoost classifier
xgb = XGBClassifier(n_estimators=100, objective='binary:logistic', random_state=42)

# Perform grid search
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Print best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

In this example:

We load the breast cancer dataset from scikit-learn and split it into train and test sets.
We define a parameter grid param_grid with the hyperparameters we want to tune. Here, we include max_depth, min_child_weight, subsample, colsample_bytree, and learning_rate, but you can add or remove parameters based on your needs.
We create an instance of the XGBoost classifier XGBClassifier with some basic parameters.
We create a GridSearchCV object grid_search, passing in the XGBoost classifier, parameter grid, and the desired number of cross-validation splits (cv). Setting n_jobs=-1 uses all available CPU cores to parallelize the search.
We fit grid_search to the training data. This will train and evaluate models for each combination of parameters in the grid.
Finally, we print the best parameters and the corresponding best score (mean cross-validated score of the best_estimator).

Grid search can be computationally expensive, especially with a large dataset or an extensive parameter grid. However, it’s a reliable way to find the best hyperparameters for your XGBoost model. You can use the best parameters found by grid search to train your final model on the full training set and evaluate its performance on the test set.

See Also