XGBoost Confidence Interval using Bootstrap and Standard Error

Estimating confidence intervals for XGBoost model performance metrics is crucial for quantifying the uncertainty associated with these estimates.

We can estimate the confidence interval of model performance using standard error from a bootstrap sample.

The standard error is a measure of the variability of the sample mean, which in this case, is the mean of the bootstrap accuracy replicates.

This example demonstrates how to use the bootstrap to estimate a 95% confidence interval for the accuracy of an XGBoost model trained on a synthetic binary classification dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import numpy as np

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=2, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a function to compute bootstrap replicates of the accuracy metric
def bootstrap_accuracy(model, X, y, n_bootstraps=1000):
    accuracies = []
    for _ in range(n_bootstraps):
        idx = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[idx], y[idx]
        model.fit(X_boot, y_boot)
        accuracies.append(model.score(X_test, y_test))
    return np.array(accuracies)

# Instantiate an XGBClassifier with default hyperparameters
model = XGBClassifier(random_state=42)

# Compute the bootstrap confidence interval for accuracy
accuracies = bootstrap_accuracy(model, X_train, y_train)
se = accuracies.std() / np.sqrt(len(accuracies))
ci_low, ci_high = accuracies.mean() - 1.96 * se, accuracies.mean() + 1.96 * se

print(f"Mean Accuracy: {accuracies.mean():.3f}")
print(f"Standard Error: {se:.3f}")
print(f"95% CI: [{ci_low:.3f}, {ci_high:.3f}]")

The standard error is calculated as follows:

se = accuracies.std() / np.sqrt(len(accuracies))

Here, accuracies.std() computes the standard deviation of the bootstrap accuracy replicates, and np.sqrt(len(accuracies)) calculates the square root of the number of replicates. Dividing the standard deviation by the square root of the sample size gives us the standard error.

Assuming a normal distribution of the sample mean, we can use the standard error to calculate the confidence interval bounds. For a 95% confidence interval, we use the z-score of 1.96:

ci_low, ci_high = accuracies.mean() - 1.96 * se, accuracies.mean() + 1.96 * se

The lower bound of the confidence interval is obtained by subtracting 1.96 times the standard error from the mean accuracy, while the upper bound is obtained by adding 1.96 times the standard error to the mean accuracy.

By incorporating the standard error in the confidence interval calculation, we account for the variability in the bootstrap accuracy replicates and provide a more precise estimate of the uncertainty surrounding the model’s performance.

This approach is particularly useful when the distribution of the performance metric is approximately normal, as the confidence interval based on the standard error assumes a normal distribution of the sample mean.

See Also