XGBoost is a powerful algorithm, but its performance heavily depends on the hyperparameters used.
Grid search is a systematic way to find the optimal combination of hyperparameters by exhaustively searching through a specified parameter space.
Here’s how to perform grid search for XGBoost using scikit-learn.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define parameter grid
param_grid = {
'max_depth': [3, 5, 7],
'min_child_weight': [1, 3, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'learning_rate': [0.01, 0.1, 0.3]
}
# Create XGBoost classifier
xgb = XGBClassifier(n_estimators=100, objective='binary:logistic', random_state=42)
# Perform grid search
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
# Print best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")
In this example:
We load the breast cancer dataset from scikit-learn and split it into train and test sets.
We define a parameter grid
param_grid
with the hyperparameters we want to tune. Here, we includemax_depth
,min_child_weight
,subsample
,colsample_bytree
, andlearning_rate
, but you can add or remove parameters based on your needs.We create an instance of the XGBoost classifier
XGBClassifier
with some basic parameters.We create a
GridSearchCV
objectgrid_search
, passing in the XGBoost classifier, parameter grid, and the desired number of cross-validation splits (cv
). Settingn_jobs=-1
uses all available CPU cores to parallelize the search.We fit
grid_search
to the training data. This will train and evaluate models for each combination of parameters in the grid.Finally, we print the best parameters and the corresponding best score (mean cross-validated score of the best_estimator).
Grid search can be computationally expensive, especially with a large dataset or an extensive parameter grid. However, it’s a reliable way to find the best hyperparameters for your XGBoost model. You can use the best parameters found by grid search to train your final model on the full training set and evaluate its performance on the test set.