XGBoosting Home | About | Contact | Examples

XGBoost NaN Input Values (missing)

Missing values are a common challenge in real-world datasets.

XGBoost, a powerful and widely-used gradient boosting library, provides built-in functionality to handle missing values during both training and inference.

In this example, we’ll demonstrate how to use the missing parameter in XGBoost to effectively deal with missing values in your input data.

from sklearn.datasets import make_classification
from xgboost import XGBClassifier
import numpy as np

# Generate a synthetic dataset with missing values
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X[np.random.choice(X.shape[0], 100), np.random.choice(X.shape[1], 100)] = np.nan

# Initialize an XGBoost classifier with the missing parameter set to np.nan
model = XGBClassifier(missing=np.nan, random_state=42)

# Train the XGBoost model on the dataset with missing values
model.fit(X, y)

# Make predictions on new data that also contains missing values
X_new = [[1, 2, np.nan, 4, 5, np.nan, 7, 8, 9, 10]]
prediction = model.predict(X_new)

Here’s what’s happening in each step:

  1. We import the necessary libraries: make_classification from scikit-learn to generate a synthetic dataset, XGBClassifier from XGBoost, and numpy for handling missing values.

  2. We generate a synthetic dataset using make_classification and introduce missing values (represented as np.nan) at random locations in the feature matrix.

  3. We initialize an XGBoost classifier with the missing parameter set to np.nan. This tells XGBoost that np.nan represents missing values in our dataset.

  4. We train the XGBoost model on the dataset that contains missing values. XGBoost will handle these missing values internally during the training process.

  5. Finally, we demonstrate making predictions on new data that also contains missing values. XGBoost will handle the missing values in the input data during inference.

By setting the missing parameter in the XGBoost model, you can ensure that missing values are properly handled during both training and prediction, allowing you to work with real-world datasets that may have incomplete data.



See Also