XGBoosting Home | About | Contact | Examples

XGboost Configure xgboost.cv() Parameters

The xgboost.cv() function is a powerful tool for performing cross-validation with XGBoost models.

It allows you to evaluate model performance, tune hyperparameters, and select the best model configuration.

Properly setting the parameters of xgboost.cv() is crucial for obtaining reliable and informative results.

from sklearn.datasets import make_classification
import xgboost as xgb
import numpy as np

# Generate a synthetic multi-class classification dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=2, random_state=42)

# Set up parameters for xgboost.cv()
params = {
    'objective': 'multi:softprob',  # Specify multiclass classification
    'num_class': 3,                 # Number of classes
    'max_depth': 3,
    'eta': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8
}

# Perform cross-validation
cv_results = xgb.cv(
    dtrain=xgb.DMatrix(X, label=y),
    params=params,
    nfold=5,                       # Number of cross-validation folds
    num_boost_round=100,           # Maximum number of boosting rounds
    metrics=['mlogloss', 'merror'], # Evaluation metrics
    early_stopping_rounds=10,      # Early stopping rounds
    seed=42,                       # Random seed for reproducibility
    shuffle=True,                  # Shuffle data before splitting
    as_pandas=True,                # Return results as a pandas DataFrame
    verbose_eval=10                # Print evaluation metric every 10 rounds
)

# Access and print the cross-validation results
print(cv_results)
print(f"Best test mlogloss: {cv_results['test-mlogloss-mean'].min():.4f} "
      f"(std = {cv_results['test-mlogloss-std'].min():.4f})")
print(f"Best test merror: {cv_results['test-merror-mean'].min():.4f} "
      f"(std = {cv_results['test-merror-std'].min():.4f})")
print(f"Best iteration: {cv_results['test-mlogloss-mean'].idxmin()}")

The most important parameters in xgboost.cv() include:

Other useful parameters include seed for setting a random seed to ensure reproducibility, shuffle for shuffling the data before splitting into folds, and verbose_eval for controlling how often the evaluation metrics are printed during training.

Note that model hyperparameters like max_depth, eta, subsample, and colsample_bytree are passed via the params argument, not directly to xgboost.cv().

The cross-validation results are returned as a pandas DataFrame by default (as_pandas=True), allowing easy access to the evaluation metrics for each fold and iteration. You can extract the best test scores and corresponding iteration using the DataFrame’s built-in methods.

By carefully configuring the parameters of xgboost.cv() and leveraging the cross-validation results, you can gain valuable insights into your model’s performance, compare different hyperparameter settings, and select the best configuration for your specific problem and dataset. Experiment with different parameter values and monitor the evaluation metrics to find the optimal setup for your XGBoost model.



See Also