XGBoost for Imbalanced Classification

Imbalanced classification tasks, where the number of instances in each class is significantly different, are common in real-world machine learning applications.

XGBoost provides effective techniques to handle class imbalance and improve model performance.

By adjusting the scale_pos_weight and max_delta_step parameters, you can effectively train an XGBoost model on imbalanced data.

scale_pos_weight controls the balance of positive and negative weights, while max_delta_step limits the maximum change in the predictions, preventing the model from giving too much importance to the minority class.

# XGBoosting.com
# Training an XGBoost Model for Imbalanced Classification
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compute the positive class weight
pos_class_weight = (len(y) - np.sum(y)) / np.sum(y)

# Initialize XGBClassifier with scale_pos_weight and max_delta_step
model = XGBClassifier(
    n_estimators=100,
    objective='binary:logistic',
    scale_pos_weight=pos_class_weight,
    max_delta_step=1,
    random_state=42
)

# Fit the model
model.fit(X_train, y_train)

# Generate predictions
predictions = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))

By setting scale_pos_weight to the ratio of the number of negative instances to the number of positive instances, the model gives more importance to the minority class during training.

The intuition for scale_pos_weight is that tells you how many negative instances (labeled as “0”) there are for each positive instance (labeled as “1”) in your dataset.

It can be set as follows (as recommended in the API documentation):

scale_pos_weight = sum(negative instances) / sum(positive instances)

The max_delta_step parameter limits the maximum change in the predictions, preventing the model from overcorrecting for the minority class.

Evaluating the model using the confusion matrix and classification report provides insights into its performance on imbalanced data, helping you assess its effectiveness and make informed decisions about further improvements.

See Also