Predict Class Probabilities with XGBoost

When working with binary or multi-class classification problems, you might want to obtain the predicted probabilities for each class instead of just the predicted class labels.

The XGBoost model predict_proba() method allows you to do exactly that, giving you more flexibility in interpreting and using your model’s predictions.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# Load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Predict class probabilities on the test data
probabilities = model.predict_proba(X_test)

print("Predicted probabilities:\n", probabilities[:5])  # Print the first 5 samples

The predict_proba() method returns a 2D array where each row corresponds to a sample, and each column represents the probability of that sample belonging to a particular class.

In the case of binary classification, there will be two columns: one for the negative class (usually labeled 0) and one for the positive class (usually labeled 1).

Having access to class probabilities provides several benefits:

Setting custom decision thresholds: Instead of using the default 0.5 threshold, you can adjust it based on your specific problem. For example, if false negatives are more costly than false positives, you might lower the threshold to increase recall.
Computing probability-based metrics: Some evaluation metrics, such as log loss, require class probabilities rather than class labels.
Ranking or prioritizing predictions: Probabilities allow you to rank predictions by their confidence, which can be useful in applications like recommender systems or risk assessment.

To convert the predicted probabilities back to class labels, you can simply apply a decision threshold:

threshold = 0.5
predicted_labels = (probabilities[:, 1] >= threshold).astype(int)

When using predict_proba(), keep in mind that the returned probabilities are estimates and may not always perfectly reflect the true underlying probabilities. The quality of these estimates depends on factors like the model’s performance, the calibration of its output, and the inherent uncertainty in the data.

See Also