When working with classification problems, you often need to predict the actual class labels for your samples, rather than the class probabilities.
While the predict_proba()
method returns the probability of each sample belonging to each class, the predict()
method directly outputs the predicted class labels.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Load the iris dataset
X, y = load_iris(return_X_y=True)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
# Fit the model on the training data
model.fit(X_train, y_train)
# Predict class labels on the test data
predicted_labels = model.predict(X_test)
print("Predicted labels:\n", predicted_labels[:5]) # Print the first 5 samples
Under the hood, predict()
applies a default threshold (usually 0.5 for binary classification) to the predicted probabilities to determine the class label. For multi-class problems, it returns the class with the highest predicted probability.
Using predict()
instead of predict_proba()
has a couple of advantages:
Computational efficiency: If you only need the final class labels and don’t plan on adjusting the decision threshold, using
predict()
is computationally cheaper and more memory efficient.Simplicity: When you have a pre-determined decision threshold and don’t require the flexibility of modifying it later,
predict()
provides a straightforward way to get your final predictions.
However, if you anticipate needing to adjust the decision threshold, compute probability-based metrics, or rank predictions by their confidence, stick with predict_proba()
and derive the class labels from the probabilities as needed.