XGBoost Configure "class_weight" Parameter for Imbalanced Classification

Class imbalance is a common issue in real-world classification problems, where the number of instances in one class significantly outweighs the other.

XGBoost provides the scale_pos_weight parameter to effectively handle imbalanced datasets by adjusting the weights of the positive class.

It’s important to note that while some other machine learning algorithms use the parameter name class_weight, XGBoost specifically uses scale_pos_weight to handle class imbalance.

This example demonstrates how to compute and set the scale_pos_weight parameter when training an XGBoost model on imbalanced data.

We’ll generate a synthetic imbalanced binary classification dataset using scikit-learn, train an XGBClassifier with scale_pos_weight, and evaluate the model’s performance using the confusion matrix and classification report.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compute the scale_pos_weight
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
print(f'pos weight: {scale_pos_weight}')

# Initialize XGBClassifier with scale_pos_weight
model = XGBClassifier(scale_pos_weight=scale_pos_weight, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Generate predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

The scale_pos_weight parameter is computed as the ratio of the number of negative instances to the number of positive instances in the training data.

By setting this parameter, XGBoost adjusts the importance of the positive class during training, effectively compensating for the class imbalance.

Evaluating the model using the confusion matrix and classification report provides insights into its performance on the imbalanced test data, allowing you to assess the effectiveness of using scale_pos_weight to handle class imbalance.

See Also