Repeated k-fold cross-validation takes the robustness of k-fold cross-validation a step further by repeating the process multiple times, providing an even more reliable estimate of your XGBoost model’s performance. Scikit-learn’s RepeatedKFold
class makes it easy to implement this powerful technique.
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score, RepeatedKFold
from xgboost import XGBRegressor
# Load the diabetes dataset
X, y = load_diabetes(return_X_y=True)
# Create an XGBRegressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Create a RepeatedKFold object
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
# Perform repeated k-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
# Convert negative MSE scores to positive RMSE scores
rmse_scores = np.sqrt(-cv_scores)
# Print the cross-validation scores
print("Cross-validation scores:", rmse_scores)
print(f"Mean cross-validation score: {np.mean(rmse_scores):.2f}")
Here’s what’s happening:
- We load the diabetes dataset and create an XGBRegressor with specified hyperparameters.
- We create a
RepeatedKFold
object, specifying the number of splits (5) and the number of times to repeat the process (3). - We use
cross_val_score()
to perform repeated k-fold cross-validation, specifying the model, input features (X), target variable (y), theRepeatedKFold
object (cv), and the scoring metric (negative mean squared error). - We convert the negative MSE scores to RMSE scores for easier interpretation.
- We print the individual cross-validation scores and their mean.
By repeating the k-fold cross-validation process, we obtain a more stable and reliable estimate of our model’s performance. This helps ensure that our model’s performance is consistent across different subsets of the data and isn’t unduly influenced by a particularly favorable or unfavorable split.