XGBoost Robust to Mislabeled Data (label noise)

XGBoost’s robustness to label noise makes it a reliable choice when working with datasets that may contain mislabeled instances.

Even if a portion of the training data has incorrect labels, XGBoost can still maintain high performance and generalize well to unseen data.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
import numpy as np

# Generate synthetic dataset with label noise
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=2,
                           n_classes=2, weights=[0.8, 0.2], random_state=42)

# Introduce label noise
noise_mask = np.random.choice([True, False], size=y.shape, p=[0.05, 0.95])
y[noise_mask] = 1 - y[noise_mask]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train XGBoost model
model = XGBClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.4f}")

Here’s a step-by-step breakdown:

Generate a synthetic binary classification dataset using make_classification from scikit-learn. This dataset will serve as a clean dataset without any label noise.
Introduce label noise by randomly flipping 5% of the labels. This simulates a scenario where a small portion of the training data has been mislabeled.
Split the noisy data into training and test sets. The test set will remain clean and will be used to evaluate the model’s performance on unseen data.
Initialize an XGBoost classifier with a fixed random state for reproducibility.
Train the XGBoost model on the noisy training data. Despite the presence of mislabeled instances, XGBoost will learn the underlying patterns and relationships in the data.
Use the trained model to make predictions on the clean test set.
Evaluate the model’s performance by calculating the accuracy score. This metric will give you an idea of how well the model generalizes to unseen data, even when trained on noisy labels.

The output will show the test accuracy, which is expected to be high despite the presence of label noise in the training data. This demonstrates XGBoost’s robustness and ability to handle mislabeled instances effectively.

See Also