When working with XGBoost, it’s often necessary to tune the model’s hyperparameters to achieve optimal performance.
Scikit-learn’s GridSearchCV
allows you to define a grid of hyperparameters, perform an exhaustive search to find the best combination, and access the best model.
This example demonstrates how to save and load the best model from a GridSearchCV
run.
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define parameter grid
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.1, 0.01, 0.05],
'n_estimators': [50, 100, 200]
}
# Create XGBClassifier
model = XGBClassifier(random_state=42)
# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)
# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")
# Access best model
best_model = grid_search.best_estimator_
# Save best model
best_model.save_model('best_model.ubj')
# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model.ubj')
# Use loaded model for predictions
predictions = loaded_model.predict(X_test)
# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")
In this example, we first generate a synthetic binary classification dataset using scikit-learn’s make_classification
function and split it into train and test sets.
We then define a parameter grid containing different values for max_depth
, learning_rate
, and n_estimators
. These are passed to GridSearchCV
along with the XGBClassifier
instance, specifying 3-fold cross-validation.
After fitting the GridSearchCV
object with the training data, we print the best score and corresponding hyperparameters. We access the best model using the best_estimator_
attribute and save it to a file named ‘best_model.ubj’ using the save_model
method.
To demonstrate loading the saved model, we create a new XGBClassifier
instance and load the saved model using the load_model
method. We then use this loaded model to make predictions on the test set and print the accuracy score.
By following this approach, you can easily save and reuse the best model obtained from a GridSearchCV
run, ensuring optimal performance in your XGBoost projects.