When comparing the performance of two XGBoost model configurations, it’s important to determine if the observed difference is statistically significant or just due to random chance.
This example demonstrates how to use cross-validation or bootstrap sampling to generate performance estimates for each configuration and then apply a statistical significance test, such as Student’s t-test, to calculate a p-value.
By comparing the p-value to a configured significance level (alpha), we can conclude whether the performance difference is statistically significant.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import xgboost as xgb
from scipy import stats
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, n_informative=10, random_state=42)
# Define two XGBoost model configurations to compare
config1 = {'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1}
config2 = {'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.05}
# Set the number of cross-validation folds and the significance level (alpha)
cv_folds = 5
alpha = 0.05
# Perform cross-validation for each configuration
scores1 = cross_val_score(xgb.XGBClassifier(**config1), X, y, cv=cv_folds, scoring='accuracy')
scores2 = cross_val_score(xgb.XGBClassifier(**config2), X, y, cv=cv_folds, scoring='accuracy')
# Calculate the mean and standard deviation of the performance metric for each configuration
mean1, std1 = np.mean(scores1), np.std(scores1)
mean2, std2 = np.mean(scores2), np.std(scores2)
# Apply a Student's t-test to the performance samples to calculate a p-value
_, p_value = stats.ttest_ind(scores1, scores2)
# Compare the p-value to the configured significance level alpha to determine statistical significance
is_significant = p_value < alpha
# Print out the mean performance of each configuration, the p-value, and whether the difference is statistically significant
print(f"Config 1 mean accuracy: {mean1:.4f} ± {std1:.4f}")
print(f"Config 2 mean accuracy: {mean2:.4f} ± {std2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"The difference is {'statistically significant' if is_significant else 'not statistically significant'} at alpha = {alpha}")
In this example, we generate a synthetic binary classification dataset using make_classification
from scikit-learn. We define two XGBoost model configurations (config1
and config2
) with different hyperparameter settings.
We set the number of cross-validation folds (cv_folds
) and the significance level (alpha
) for the statistical test. We then perform cross-validation for each configuration using cross_val_score
from scikit-learn, which returns an array of scores for each fold.
We calculate the mean and standard deviation of the accuracy scores for each configuration. To determine if the difference in performance is statistically significant, we apply a Student’s t-test using ttest_ind
from scipy.stats, which returns the t-statistic and the p-value.
Finally, we compare the p-value to the configured significance level alpha
to determine if the difference is statistically significant. We print out the mean accuracy and standard deviation for each configuration, the p-value, and whether the difference is statistically significant based on the chosen alpha
.
By using this approach, we can make informed decisions about which XGBoost configuration to choose based on their performance and the statistical significance of the difference between them.