When dealing with imbalanced classification problems, it’s essential to ensure that the class distribution is preserved in each fold during cross-validation.
Stratified k-fold cross-validation addresses this issue by maintaining the same class proportions in each fold as in the original dataset.
This example demonstrates how to use stratified k-fold cross-validation with XGBoost for evaluating model performance on an imbalanced classification dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from xgboost import XGBClassifier
import numpy as np
# Generate an imbalanced binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
# Create a StratifiedKFold object
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Perform stratified k-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
# Print the cross-validation scores
print("Cross-validation scores:", cv_scores)
print(f"Mean cross-validation score: {np.mean(cv_scores):.2f} +/- {np.std(cv_scores):.2f}")
Here’s a breakdown of the example:
- We generate a synthetic imbalanced binary classification dataset using scikit-learn’s
make_classification
function, specifying the desired class weights. - We create an
XGBClassifier
with specified hyperparameters. - We instantiate a
StratifiedKFold
object, specifying the number of splits, whether to shuffle the data, and the random state for reproducibility. - We use
cross_val_score
to perform stratified k-fold cross-validation, specifying the model, input features (X), target variable (y), theStratifiedKFold
object (cv), and the scoring metric (f1 score). - We print the individual cross-validation scores and their mean and standard deviation.
By using stratified k-fold cross-validation, we ensure that the model’s performance is evaluated on a representative sample of the imbalanced dataset in each fold. This provides a more accurate estimate of the model’s performance compared to regular k-fold cross-validation, which may result in folds with significantly different class distributions than the original dataset.