XGBoosting Home | About | Contact | Examples

Predict Class Labels with XGBoost

When working with classification problems, you often need to predict the actual class labels for your samples, rather than the class probabilities.

While the predict_proba() method returns the probability of each sample belonging to each class, the predict() method directly outputs the predicted class labels.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# Load the iris dataset
X, y = load_iris(return_X_y=True)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Predict class labels on the test data
predicted_labels = model.predict(X_test)

print("Predicted labels:\n", predicted_labels[:5])  # Print the first 5 samples

Under the hood, predict() applies a default threshold (usually 0.5 for binary classification) to the predicted probabilities to determine the class label. For multi-class problems, it returns the class with the highest predicted probability.

Using predict() instead of predict_proba() has a couple of advantages:

  1. Computational efficiency: If you only need the final class labels and don’t plan on adjusting the decision threshold, using predict() is computationally cheaper and more memory efficient.

  2. Simplicity: When you have a pre-determined decision threshold and don’t require the flexibility of modifying it later, predict() provides a straightforward way to get your final predictions.

However, if you anticipate needing to adjust the decision threshold, compute probability-based metrics, or rank predictions by their confidence, stick with predict_proba() and derive the class labels from the probabilities as needed.



See Also