XGBoost booster.predict() vs XGBClassifer.predict()

When it comes to performing inference with a trained XGBoost model, you have two main options: booster.predict() and XGBClassifier.predict().

While both methods allow you to make predictions, they differ in their API design and input data format. This example demonstrates the key differences between these approaches and provides code examples for each.

Let’s start by training an XGBoost model using xgboost.train and making predictions with booster.predict():

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_classes=2, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

params = {
    'objective': 'binary:logistic',
    'max_depth': 3,
    'learning_rate': 0.1,
    'random_state': 42
}

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

model = xgb.train(params, dtrain, num_boost_round=100)

y_pred_proba = model.predict(dtest)
y_pred = (y_pred_proba > 0.5).astype(int)

In this approach, we define the model parameters in a dictionary and create DMatrix objects for the train and test data. We then train the model using xgb.train() and make predictions on the test data using booster.predict(), which returns the raw probabilities. To get the class labels, we apply a threshold to the probabilities.

Now, let’s train the same model using XGBClassifier and make predictions with XGBClassifier.predict():

from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_classes=2, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

params = {
    'max_depth': 3,
    'learning_rate': 0.1,
    'n_estimators': 100,
    'random_state': 42
}

model = XGBClassifier(**params)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

With XGBClassifier, we define the model parameters directly in the constructor. We then instantiate the classifier with these parameters and train the model using the fit() method, which takes the training data as numpy arrays or pandas DataFrames. To make predictions, we simply call model.predict() on the test data, which returns the class labels directly.

The key differences between booster.predict() and XGBClassifier.predict() are:

booster.predict() uses DMatrix for input data, while XGBClassifier.predict() uses numpy arrays or pandas DataFrames.
booster.predict() returns raw probabilities, while XGBClassifier.predict() returns class labels by default (you can use model.predict_proba() to get probabilities).
XGBClassifier provides a simpler, scikit-learn compatible API, making it easier to integrate with existing pipelines.

When deciding which approach to use, consider your specific needs:

Use booster.predict() if you require more control over the inference process or are working with DMatrix objects.
Use XGBClassifier.predict() if you want a quick and easy way to make predictions or need to integrate with scikit-learn pipelines.

By understanding the differences between booster.predict() and XGBClassifier.predict(), you can choose the most suitable approach for your XGBoost model inference tasks.

See Also