When working with classification problems, you often need to predict the actual class labels for your samples, rather than the class probabilities.
While the predict_proba() method returns the probability of each sample belonging to each class, the predict() method directly outputs the predicted class labels.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Load the iris dataset
X, y = load_iris(return_X_y=True)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
# Fit the model on the training data
model.fit(X_train, y_train)
# Predict class labels on the test data
predicted_labels = model.predict(X_test)
print("Predicted labels:\n", predicted_labels[:5]) # Print the first 5 samples
Under the hood, predict() applies a default threshold (usually 0.5 for binary classification) to the predicted probabilities to determine the class label. For multi-class problems, it returns the class with the highest predicted probability.
Using predict() instead of predict_proba() has a couple of advantages:
Computational efficiency: If you only need the final class labels and don’t plan on adjusting the decision threshold, using
predict()is computationally cheaper and more memory efficient.Simplicity: When you have a pre-determined decision threshold and don’t require the flexibility of modifying it later,
predict()provides a straightforward way to get your final predictions.
However, if you anticipate needing to adjust the decision threshold, compute probability-based metrics, or rank predictions by their confidence, stick with predict_proba() and derive the class labels from the probabilities as needed.