XGBoost Evaluate Model using Stratified k-Fold Cross-Validation

When dealing with imbalanced classification problems, it’s essential to ensure that the class distribution is preserved in each fold during cross-validation.

Stratified k-fold cross-validation addresses this issue by maintaining the same class proportions in each fold as in the original dataset.

This example demonstrates how to use stratified k-fold cross-validation with XGBoost for evaluating model performance on an imbalanced classification dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from xgboost import XGBClassifier
import numpy as np

# Generate an imbalanced binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Create a StratifiedKFold object
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform stratified k-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=cv, scoring='f1')

# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)
print(f"Mean cross-validation score: {np.mean(cv_scores):.2f} +/- {np.std(cv_scores):.2f}")

Here’s a breakdown of the example:

We generate a synthetic imbalanced binary classification dataset using scikit-learn’s make_classification function, specifying the desired class weights.
We create an XGBClassifier with specified hyperparameters.
We instantiate a StratifiedKFold object, specifying the number of splits, whether to shuffle the data, and the random state for reproducibility.
We use cross_val_score to perform stratified k-fold cross-validation, specifying the model, input features (X), target variable (y), the StratifiedKFold object (cv), and the scoring metric (f1 score).
We print the individual cross-validation scores and their mean and standard deviation.

By using stratified k-fold cross-validation, we ensure that the model’s performance is evaluated on a representative sample of the imbalanced dataset in each fold. This provides a more accurate estimate of the model’s performance compared to regular k-fold cross-validation, which may result in folds with significantly different class distributions than the original dataset.

See Also