Impute Missing Input Values for XGBoost

XGBoost has native support for missing values.

Nevertheless, we can choose to impute missing values in our dataset if we desire. This might be preferred in cases where we do not wish for the model to treat missing values as a different value and instead to use a mean or median value in the training data.

In this example, we demonstrate how to use SimpleImputer from scikit-learn for efficient imputation of missing values.

from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor
import numpy as np

# Synthetic feature matrix X with missing values
X = np.array([[2.5, 1.0, np.nan],
              [5.0, np.nan, 4.0],
              [3.0, 1.5, 3.5],
              [1.0, 0.5, 2.0],
              [4.5, 1.8, np.nan],
              [2.8, 1.2, 3.2]])

y = [10, 20, 15, 5, 18, 12]

# Initialize SimpleImputer
imputer = SimpleImputer(strategy='mean')

# Fit and transform the input features
X_imputed = imputer.fit_transform(X)

# Initialize and train XGBoost model
model = XGBRegressor(random_state=42)
model.fit(X_imputed, y)

# New data for prediction with missing values
X_new = np.array([[3.2, np.nan, 3.4],
                  [1.5, 0.8, np.nan]])

# Impute missing values in new data
X_new_imputed = imputer.transform(X_new)

# Make predictions
predictions = model.predict(X_new_imputed)

print("Predictions:", predictions)

Here’s a step-by-step breakdown:

Import the necessary classes: SimpleImputer from sklearn.impute for imputing missing values, and XGBRegressor from xgboost for building the XGBoost model.
Create a synthetic feature matrix X with missing values denoted by np.nan, and a corresponding target variable y.
Initialize a SimpleImputer object with a strategy parameter set to 'mean'. This tells the imputer to replace missing values with the mean value of each feature.
Fit the imputer on the feature matrix X and transform it to fill in the missing values using fit_transform. This step calculates the mean of each feature and replaces the missing values with these means.
Initialize an XGBRegressor with any desired hyperparameters. Here, we set a random_state for reproducibility.
Train the XGBoost model using the imputed feature matrix X_imputed and the target variable y.
When new data X_new arrives with missing values, use the fitted imputer to transform and fill in the missing values using transform. This step applies the same imputation strategy used during training.
Make predictions using the XGBoost model with the imputed new data X_new_imputed.

In addition to the ‘mean’ strategy, SimpleImputer offers other imputation strategies such as ‘median’, ‘most_frequent’, and ‘constant’. Choose the strategy that best suits your data and problem.

It’s important to apply the same imputation strategy and imputer to both the training data and any new or test data to ensure consistency in how missing values are handled.

See Also