Missing Input Values With XGBoost

XGBoost has built-in functionality to handle missing values in training data.

By setting the missing parameter when initializing the XGBoost model, you can specify the value that represents missing data, and XGBoost will intelligently handle these missing values during training and inference.

from sklearn.datasets import make_classification
from xgboost import XGBClassifier
import numpy as np

# Generate example dataset with missing values
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X[np.random.choice(X.shape[0], 100), np.random.choice(X.shape[1], 100)] = np.nan

# Initialize XGBoost model with missing value specification
model = XGBClassifier(missing=np.nan, random_state=42)

# Train the model on data with missing values
model.fit(X, y)

# Make predictions on new data with missing values
X_new = [[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9, 10]]
prediction = model.predict(X_new)

Here’s what’s happening:

We generate a synthetic dataset using make_classification from scikit-learn and introduce missing values represented as np.nan at random locations.
We initialize an XGBoost classifier, setting the missing parameter to np.nan. This tells XGBoost that np.nan represents missing data in our dataset.
We train the model on the dataset that contains missing values. XGBoost will handle these missing values internally during the training process.
Finally, we demonstrate making a prediction on new data that also contains missing values. XGBoost will handle the missing values in the input data during inference.

See Also