XGBoost provides a built-in function for performing k-fold cross-validation, which can simplify your code and potentially speed up the evaluation process compared to using an external library like scikit-learn. The cv()
function in XGBoost’s native API makes it easy to perform cross-validation with just a few lines of code.
from sklearn.datasets import fetch_california_housing
import xgboost as xgb
# Load the California Housing dataset
X, y = fetch_california_housing(return_X_y=True)
# Create a DMatrix object from the data
data = xgb.DMatrix(X, label=y)
# Specify the XGBoost parameters
params = {
'objective': 'reg:squarederror',
'learning_rate': 0.1,
'max_depth': 6,
'subsample': 0.8,
'colsample_bytree': 0.8,
'seed': 42
}
# Perform k-fold cross-validation
cv_results = xgb.cv(
params,
data,
num_boost_round=100,
nfold=5,
metrics='rmse',
as_pandas=True,
seed=42
)
# Print the cross-validation results
print(cv_results)
print(f"Best RMSE: {cv_results['test-rmse-mean'].min():.2f} at iteration {cv_results['test-rmse-mean'].idxmin()}")
Here’s what’s happening:
- We load the California Housing dataset and create a DMatrix object from the data, which is the data structure used by XGBoost’s native API.
- We specify the XGBoost parameters in a dictionary, including the objective function, learning rate, max depth, subsample, colsample_bytree, and random seed.
- We use
xgb.cv()
to perform 5-fold cross-validation, specifying the parameters, training data, number of boosting rounds, evaluation metric (RMSE), and other settings. - We print the cross-validation results, which include the mean and standard deviation of the evaluation metric for each fold and iteration.
- Finally, we print the best RMSE score and the corresponding iteration.
By using XGBoost’s native API for cross-validation, you can take advantage of its optimized implementation and keep your code concise and focused on the XGBoost-specific configuration.