K-fold cross-validation provides a more robust estimate of your XGBoost model’s performance by training and evaluating the model on multiple subsets of your data. Scikit-learn’s cross_val_score
function makes it easy to perform k-fold cross-validation with just a few lines of code.
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
# Load the diabetes dataset
X, y = load_diabetes(return_X_y=True)
# Create an XGBRegressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
# Convert negative MSE scores to positive RMSE scores
rmse_scores = np.sqrt(-cv_scores)
# Print the cross-validation scores
print("Cross-validation scores:", rmse_scores)
print(f"Mean cross-validation score: {np.mean(rmse_scores):.2f}")
Here’s what’s happening:
- We load the diabetes dataset and create an XGBRegressor with specified hyperparameters.
- We use
cross_val_score()
to perform 5-fold cross-validation, specifying the model, input features (X), target variable (y), number of folds (cv), and the scoring metric (negative mean squared error). - We convert the negative MSE scores to RMSE scores for easier interpretation.
- We print the individual cross-validation scores and their mean.
K-fold cross-validation helps prevent overfitting by ensuring your model performs well across different subsets of your data. It’s especially useful when you have limited data and can’t afford a separate validation set. By using scikit-learn’s built-in cross_val_score
, you can easily incorporate this powerful technique into your model evaluation workflow.