Fit Final XGBoost Model and Predict on Out-Of-Sample Data

When building a final XGBoost model for production, it’s important to leverage all available data to maximize performance.

This example demonstrates how to train an XGBoost classifier on the entire dataset and make predictions on out-of-sample data.

# XGBoosting.com
# XGBoost Fit Final Model and Predict on Out-Of-Sample Data
from sklearn.datasets import make_classification
from xgboost import XGBClassifier

# Generate a synthetic dataset for binary classification
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=2, random_state=42)

# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on all data
model.fit(X, y)

# define a new input record
new_X = [[2.43978951,-1.86244336,-1.29449335,-0.5566518,2.16011618,-0.68403094,1.6582967,0.51414583,-0.74371609,4.55530742]]

# Make predictions on new data
yhat = model.predict(new_X)

print("Predicted class label:\n", yhat[0])

Here’s what the code does:

We generate a synthetic dataset for a binary classification problem using make_classification from scikit-learn.
An XGBClassifier is instantiated with specified hyperparameters.
The model is fitted on the entire dataset.
A new record is defined and predictions are made on the new record test set using model.predict(new_X).
The predicted class label is printed.

By training the XGBoost model on the entire available dataset, we ensure that it has access to all relevant information for making accurate predictions on unseen data. This approach is particularly useful when deploying the final model in a production environment, where it will encounter new, out-of-sample data.

Remember to apply any necessary data preprocessing steps to the input data before making predictions, ensuring consistency between the training and inference phases.

See Also