XGBoost Comparing Models With Effect Size

When comparing the performance of different machine learning models, it’s essential to assess not only the statistical significance of the differences but also the magnitude of the effect.

Effect size is a quantitative measure of the difference between two groups, providing insights into the practical significance of the results.

In this example, we’ll compare the performance of XGBoost and Random Forest classifiers on a binary classification problem, calculate the effect size using Cohen’s d, and interpret the results to determine the practical significance of the difference between the two models.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from scipy import stats
import numpy as np

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, n_informative=10, random_state=42)

# Define an XGBoost classifier and a Random Forest classifier
xgb_clf = xgb.XGBClassifier(n_estimators=100, random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Set the number of cross-validation folds
cv_folds = 5

# Perform cross-validation for each model
xgb_scores = cross_val_score(xgb_clf, X, y, cv=cv_folds, scoring='accuracy')
rf_scores = cross_val_score(rf_clf, X, y, cv=cv_folds, scoring='accuracy')

# Calculate the mean and standard deviation of the accuracy scores for each model
xgb_mean, xgb_std = np.mean(xgb_scores), np.std(xgb_scores)
rf_mean, rf_std = np.mean(rf_scores), np.std(rf_scores)

# Apply a Student's t-test to the performance samples to calculate a p-value
_, p_value = stats.ttest_ind(xgb_scores, rf_scores)

# Calculate Cohen's d effect size
effect_size = (xgb_mean - rf_mean) / np.sqrt((xgb_std**2 + rf_std**2) / 2)

# Interpret the effect size based on Cohen's guidelines
if abs(effect_size) < 0.2:
    interpretation = "small"
elif abs(effect_size) < 0.5:
    interpretation = "medium"
else:
    interpretation = "large"

# Print out the results
print(f"XGBoost mean accuracy: {xgb_mean:.4f} ± {xgb_std:.4f}")
print(f"Random Forest mean accuracy: {rf_mean:.4f} ± {rf_std:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Effect size (Cohen's d): {effect_size:.4f} ({interpretation})")

In this example, we generate a synthetic binary classification dataset using make_classification from scikit-learn. We then define an XGBoost classifier (xgb_clf) and a Random Forest classifier (rf_clf) with default hyperparameters.

We perform k-fold cross-validation for each model using cross_val_score from scikit-learn, which returns an array of accuracy scores for each fold. We calculate the mean and standard deviation of the accuracy scores for each model.

To determine the statistical significance of the difference in performance, we apply a Student’s t-test using ttest_ind from scipy.stats, which returns the t-statistic and the p-value.

Next, we calculate the effect size using Cohen’s d formula, which is the difference between the means divided by the pooled standard deviation. We interpret the effect size based on Cohen’s guidelines: small (d=0.2), medium (d=0.5), and large (d=0.8).

Finally, we print out the mean accuracy and standard deviation for each model, the p-value, the effect size, and its interpretation.

By calculating and interpreting the effect size alongside statistical significance, we can gain a more comprehensive understanding of the performance difference between XGBoost and Random Forest classifiers, enabling us to make informed decisions about which model to choose based on both statistical and practical significance.

See Also