XGBoost "sample_weight" to Bias Training Toward Recent Examples (Data Drift)

Data drift, a phenomenon where the statistical properties of the target variable change over time, can cause the performance of machine learning models to degrade if not addressed.

One approach to mitigate the impact of data drift is to bias the model towards more recent examples during training.

In this example, we’ll demonstrate how to use the sample_weight parameter in XGBoost to assign higher weights to more recent training instances, effectively biasing the model to adapt to the changing data distribution.

We’ll generate a synthetic dataset that exhibits data drift, train two XGBClassifier models (one with and one without sample_weight biasing), and compare their performance.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import numpy as np

# Generate a synthetic dataset with data drift
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)

# Introduce data drift by flipping labels for 30% of half of the data (first half)
drift_start_idx = 0
num_flipped = int(0.3 * (len(X) - drift_start_idx))
y[drift_start_idx:drift_start_idx+num_flipped] = 1 - y[drift_start_idx:drift_start_idx+num_flipped]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create sample_weight array with exponential decay
sample_weight = np.exp(np.linspace(0, 2, len(y_train)))

# Train XGBClassifier without sample_weight
model_unweighted = XGBClassifier(random_state=42)
model_unweighted.fit(X_train, y_train)

# Train XGBClassifier with sample_weight
model_weighted = XGBClassifier(random_state=42)
model_weighted.fit(X_train, y_train, sample_weight=sample_weight)

# Generate predictions
y_pred_unweighted = model_unweighted.predict(X_test)
y_pred_weighted = model_weighted.predict(X_test)

# Compare model performance
print("Unweighted Model Accuracy:", accuracy_score(y_test, y_pred_unweighted))
print("Weighted Model Accuracy:", accuracy_score(y_test, y_pred_weighted))

print("\nUnweighted Model Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_unweighted))
print("\nWeighted Model Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_weighted))

print("\nUnweighted Model Classification Report:")
print(classification_report(y_test, y_pred_unweighted))
print("\nWeighted Model Classification Report:")
print(classification_report(y_test, y_pred_weighted))

In this example, we generate a synthetic binary classification dataset using make_classification from scikit-learn. To simulate data drift, we flip the labels for 30% of the instances in the first half of the dataset. This means that the latter examples are more “correct” than the former examples in the dataset

We then create a sample_weight array using exponential decay, assigning higher weights to more recent examples. This biases the model to focus more on the newer data points during training.

We train two XGBClassifier models: one without sample_weight and one with sample_weight. After generating predictions on the test set, we compare the performance of the two models using accuracy, confusion matrix, and classification report.

The results demonstrate that the model trained with sample_weight biasing adapts better to the data drift, as evidenced by its higher accuracy and improved classification metrics compared to the unweighted model.

By leveraging the sample_weight parameter in XGBoost, you can effectively address data drift and maintain model performance even when the data distribution changes over time.

See Also