XGBoost Evaluate Model using k-Fold Cross-Validation

K-fold cross-validation provides a more robust estimate of your XGBoost model’s performance by training and evaluating the model on multiple subsets of your data. Scikit-learn’s cross_val_score function makes it easy to perform k-fold cross-validation with just a few lines of code.

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

# Load the diabetes dataset
X, y = load_diabetes(return_X_y=True)

# Create an XGBRegressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# Convert negative MSE scores to positive RMSE scores
rmse_scores = np.sqrt(-cv_scores)

# Print the cross-validation scores
print("Cross-validation scores:", rmse_scores)
print(f"Mean cross-validation score: {np.mean(rmse_scores):.2f}")

Here’s what’s happening:

  1. We load the diabetes dataset and create an XGBRegressor with specified hyperparameters.
  2. We use cross_val_score() to perform 5-fold cross-validation, specifying the model, input features (X), target variable (y), number of folds (cv), and the scoring metric (negative mean squared error).
  3. We convert the negative MSE scores to RMSE scores for easier interpretation.
  4. We print the individual cross-validation scores and their mean.

K-fold cross-validation helps prevent overfitting by ensuring your model performs well across different subsets of your data. It’s especially useful when you have limited data and can’t afford a separate validation set. By using scikit-learn’s built-in cross_val_score, you can easily incorporate this powerful technique into your model evaluation workflow.

