XGBoost for Imbalanced Classification with SMOTE

Imbalanced datasets, where one class significantly outnumbers the other, pose challenges for machine learning algorithms like XGBoost.

Directly training on such data can lead to biased models that favor the majority class.

XGBoost provides native support for imbalanced classificaiton via the scale_pos_weight parameter. Nevertheless, XGBoost may benefit from resampling of the training dataset to adjust the class imbalance.

SMOTE (Synthetic Minority Over-sampling Technique) is a powerful data preparation method that balances class distribution by creating synthetic examples of the minority class.

Combining SMOTE with XGBoost can greatly improve classification performance on imbalanced datasets.

Firstly, we must install the imblearn library using our preferred Python package manager, such as pip:

pip install imblearn

We can then use SMOTE to resample the training dataset and fit an XGBoost model on the resampled dataset.

from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from xgboost import XGBClassifier

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Original class distribution: {Counter(y)}")

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print(f"Resampled class distribution: {Counter(y_train_resampled)}")

# Train the XGBoost classifier
model = XGBClassifier(n_estimators=100, objective='binary:logistic', random_state=42)
model.fit(X_train_resampled, y_train_resampled)

# Generate predictions
predictions = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))

By applying SMOTE to the training data before fitting the XGBoost model, we ensure that the classifier is exposed to a balanced class distribution during training. This helps prevent the model from being biased towards the majority class.

The code snippet demonstrates the complete workflow, from generating an imbalanced dataset to training an XGBoost classifier on SMOTE-transformed data and evaluating its performance using a confusion matrix and classification report.

Combining SMOTE with XGBoost is a powerful technique for handling imbalanced classification tasks, enabling the development of more robust and accurate models.

See Also