When working with imbalanced classification tasks, where the number of instances in each class is significantly different, XGBoost provides two main parameters to handle class imbalance: scale_pos_weight
and sample_weight
.
This example demonstrates how to use both parameters and compares their performance using evaluation metrics on a synthetic imbalanced dataset.
scale_pos_weight
example:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Compute scale_pos_weight as ratio of negative to positive instances in train set
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])
# Initialize XGBClassifier with scale_pos_weight
model_spw = XGBClassifier(n_estimators=100, scale_pos_weight=scale_pos_weight, random_state=42)
# Train model and evaluate performance on test set
model_spw.fit(X_train, y_train)
pred_spw = model_spw.predict(X_test)
print("scale_pos_weight model:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_spw))
print("\nClassification Report:")
print(classification_report(y_test, pred_spw))
sample_weight
example:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create sample_weight array mapping class weights to instances in train set
class_weights = {0: 1, 1: 10}
sample_weights = np.array([class_weights[class_id] for class_id in y_train])
# Initialize XGBClassifier with default parameters
model_sw = XGBClassifier(n_estimators=100, random_state=42)
# Train model using sample_weight and evaluate on test set
model_sw.fit(X_train, y_train, sample_weight=sample_weights)
pred_sw = model_sw.predict(X_test)
print("sample_weight model:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_sw))
print("\nClassification Report:")
print(classification_report(y_test, pred_sw))
The scale_pos_weight
parameter is a global approach that balances the overall class weights for binary classification tasks. It is set to the ratio of the number of negative instances to the number of positive instances in the training set.
On the other hand, sample_weight
allows weighting the importance of individual instances and works for both binary and multiclass problems. In this example, we create a sample_weight
array by mapping class weights to the corresponding instances in the training set, assigning a weight of 1 to the majority class and 10 to the minority class.
The choice between scale_pos_weight
and sample_weight
depends on the specific characteristics of your dataset and the desired behavior of the model. Evaluating the performance of both approaches using metrics like precision, recall, and F1-score can help you determine which technique works better for your specific problem.