Predict Class Labels with XGBoost

When working with classification problems, you often need to predict the actual class labels for your samples, rather than the class probabilities.

While the predict_proba() method returns the probability of each sample belonging to each class, the predict() method directly outputs the predicted class labels.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# Load the iris dataset
X, y = load_iris(return_X_y=True)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Predict class labels on the test data
predicted_labels = model.predict(X_test)

print("Predicted labels:\n", predicted_labels[:5])  # Print the first 5 samples

Under the hood, predict() applies a default threshold (usually 0.5 for binary classification) to the predicted probabilities to determine the class label. For multi-class problems, it returns the class with the highest predicted probability.

Using predict() instead of predict_proba() has a couple of advantages:

Computational efficiency: If you only need the final class labels and don’t plan on adjusting the decision threshold, using predict() is computationally cheaper and more memory efficient.
Simplicity: When you have a pre-determined decision threshold and don’t require the flexibility of modifying it later, predict() provides a straightforward way to get your final predictions.

However, if you anticipate needing to adjust the decision threshold, compute probability-based metrics, or rank predictions by their confidence, stick with predict_proba() and derive the class labels from the probabilities as needed.

See Also