XGBoost Multi-Class Imbalanced Classification

While XGBoost’s scale_pos_weight parameter is effective for handling class imbalance in binary classification problems, it does not apply to multi-class scenarios.

When dealing with imbalanced data in multi-class classification, the appropriate approach is to use the sample_weight parameter to assign weights to each instance based on its class frequency.

This example demonstrates how to compute and set the sample_weight parameter when training an XGBoost model on an imbalanced multi-class dataset.

We’ll generate a synthetic imbalanced multi-class classification dataset using scikit-learn, train an XGBClassifier with sample_weight, and evaluate the model’s performance using the confusion matrix and classification report.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Generate an imbalanced synthetic multi-class dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_clusters_per_class=1, weights=[0.7, 0.2, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compute sample_weight based on class frequencies
class_weights = dict(enumerate(len(y_train) / (len(np.unique(y_train)) * np.bincount(y_train))))
sample_weight = np.array([class_weights[label] for label in y_train])

# Initialize XGBClassifier with sample_weight
model = XGBClassifier(random_state=42)

# Train the model with sample_weight
model.fit(X_train, y_train, sample_weight=sample_weight)

# Generate predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

The sample_weight parameter is computed based on the class frequencies in the training data. We first calculate the class weights using the formula len(y_train) / (len(np.unique(y_train)) * np.bincount(y_train)), which assigns higher weights to instances from underrepresented classes.

Then, we create a sample_weight array by mapping each training label to its corresponding class weight.

By passing the sample_weight array to the fit method of the XGBClassifier, we ensure that the model assigns greater importance to instances from underrepresented classes during training, effectively addressing the class imbalance.

Evaluating the model using the confusion matrix and classification report provides insights into its performance on the imbalanced multi-class test data, allowing you to assess the effectiveness of using sample_weight to handle multi-class imbalance in XGBoost.

See Also