XGBoost Configure "sample_weight" Parameter for Imbalanced Classification

When dealing with imbalanced classification tasks, where the number of instances in each class is significantly different, XGBoost offers two main approaches to handle class imbalance: sample_weight and scale_pos_weight.

Generally, the sample_weight parameter is used to weigh the importance of instances (rows) in the training data, whereas the scale_pos_weight parameter is used to weight the importance of labels (classes) in the training data.

The sample_weight parameter can be used to function much like the scale_pos_weight parameter, in desired.

This example demonstrates how to use both techniques and compares their performance using evaluation metrics.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compute class weights for sample_weight
class_weights = {0: 1, 1: 10}
sample_weights = np.array([class_weights[class_id] for class_id in y_train])

# Compute positive class weight for scale_pos_weight (about 8.6)
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Initialize XGBClassifier with scale_pos_weight
model_spw = XGBClassifier(n_estimators=100, scale_pos_weight=scale_pos_weight, random_state=42)

# Initialize XGBClassifier with sample_weight
model_sw = XGBClassifier(n_estimators=100, random_state=42)

# Fit the models
model_spw.fit(X_train, y_train)
model_sw.fit(X_train, y_train, sample_weight=sample_weights)

# Generate predictions
pred_spw = model_spw.predict(X_test)
pred_sw = model_sw.predict(X_test)

# Evaluate the models
print("Model with scale_pos_weight:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_spw))
print("\nClassification Report:")
print(classification_report(y_test, pred_spw))

print("\nModel with sample_weight:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_sw))
print("\nClassification Report:")
print(classification_report(y_test, pred_sw))

In this example, we generate a synthetic imbalanced dataset using make_classification from scikit-learn. We then initialize two XGBClassifier models: one with scale_pos_weight set to balance class weights and another with sample_weight set based on class frequencies.

For the scale_pos_weight approach, we set the parameter value to the ratio of the number of negative instances to the number of positive instances (calculated to be about 8.6 in this case).

For the sample_weight approach, we create a dictionary class_weights that assigns a weight of 1 to the majority class (0) and a weight of 10 to the minority class (1). We then create an array sample_weights by mapping the class weights to the corresponding instances in the training set.

After training both models on the same data, we generate predictions and evaluate their performance using the confusion matrix and classification report.

The choice between sample_weight and scale_pos_weight depends on the specific characteristics of your dataset and the desired behavior of the model. sample_weight allows for more fine-grained control over the importance of individual instances, while scale_pos_weight is a more global approach that balances the overall class weights.

In general, if you have a highly imbalanced dataset and want to give more importance to the minority class, scale_pos_weight can be a simple and effective solution. However, if you have additional information about the importance of individual instances or want to assign different weights to specific samples, sample_weight provides more flexibility.

Evaluating the performance of both approaches using metrics like precision, recall, and F1-score can help you determine which technique works better for your specific problem.

See Also