XGBoost Tune "scale_pos_weight" Parameter

When working with imbalanced datasets in binary classification tasks, where one class significantly outnumbers the other, it’s crucial to adjust the scale_pos_weight parameter in XGBoost to achieve optimal performance.

It is recommended that scale_pos_weight parameter should be set to the ratio of negative instances to positive instances in the dataset.

scale_pos_weight = sum(negative instances) / sum(positive instances)

This may or may not be the optimal ratio for your specific dataset and chosen evaluation metric.

This example demonstrates how to tune scale_pos_weight by trying different ratios and evaluating their impact on model performance metrics.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
import pandas as pd

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=10000, n_classes=2, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Calculate the default ratio (about 18)
default = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Define the range of scale_pos_weight ratios to test
ratios = [1, 5, 10, default, 20, 50, 100]

# Train and evaluate XGBoost models with different scale_pos_weight ratios
results = []
for ratio in ratios:
    model = XGBClassifier(scale_pos_weight=ratio, n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    results.append({'scale_pos_weight': ratio, 'Precision': precision, 'Recall': recall, 'F1-score': f1, 'ROC AUC': auc})

# Display results
results_df = pd.DataFrame(results)
print(results_df)

The code above generates an imbalanced synthetic dataset using scikit-learn’s make_classification function, with 95% of samples belonging to the majority class and 5% to the minority class. It then trains XGBoost models with different scale_pos_weight ratios and evaluates their performance using precision, recall, F1-score, and ROC AUC.

The results DataFrame will display the performance metrics for each scale_pos_weight ratio, allowing you to compare and select the best value for your specific use case. Generally, you should choose the ratio that optimizes your target metric(s) while considering the trade-offs between precision and recall.

It’s important to note that the optimal scale_pos_weight ratio may vary depending on the dataset and the specific problem at hand. Therefore, it’s recommended to experiment with a range of values and validate the results using cross-validation or a separate validation set before finalizing the model.

By tuning the scale_pos_weight parameter in XGBoost, you can effectively handle imbalanced datasets and improve the performance of your binary classification models.

See Also