When dealing with imbalanced multi-class classification problems, XGBoost’s sample_weight
parameter can be used to assign weights to instances based on class frequencies.
Scikit-learn’s compute_sample_weight
function provides a convenient way to calculate sample weights for imbalanced datasets.
This example demonstrates how to use compute_sample_weight
to set the sample_weight
parameter in XGBoost for training on an imbalanced multi-class dataset.
We’ll generate a synthetic imbalanced multi-class dataset using scikit-learn, train an XGBClassifier with the computed sample weights, and evaluate the model’s performance using a confusion matrix and classification report.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils import compute_sample_weight
# Generate an imbalanced synthetic multi-class dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_clusters_per_class=1, weights=[0.7, 0.2, 0.1], random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Compute sample_weight using compute_sample_weight
sample_weight = compute_sample_weight('balanced', y_train)
# Initialize XGBClassifier
model = XGBClassifier(random_state=42)
# Train the model with sample_weight
model.fit(X_train, y_train, sample_weight=sample_weight)
# Generate predictions
y_pred = model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
The compute_sample_weight
function takes the 'balanced'
mode, which calculates sample weights inversely proportional to class frequencies in the input data. By passing the training labels y_train
to compute_sample_weight
, we obtain an array of sample weights that assign higher weights to instances from underrepresented classes.
We then initialize an XGBClassifier and train it using the computed sample_weight
. This ensures that the model gives more importance to instances from minority classes during training, effectively addressing the class imbalance.
Evaluating the model using the confusion matrix and classification report provides insights into its performance on the imbalanced multi-class test data. By comparing these metrics with those obtained from training without sample weights, you can assess the effectiveness of using compute_sample_weight
to handle multi-class imbalance in XGBoost.