XGBoost has native support for missing values.
Nevertheless, we can choose to impute missing values in our dataset if we desire. This might be preferred in cases where we do not wish for the model to treat missing values as a different value and instead to use a mean or median value in the training data.
In this example, we demonstrate how to use SimpleImputer from scikit-learn for efficient imputation of missing values.
from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor
import numpy as np
# Synthetic feature matrix X with missing values
X = np.array([[2.5, 1.0, np.nan],
[5.0, np.nan, 4.0],
[3.0, 1.5, 3.5],
[1.0, 0.5, 2.0],
[4.5, 1.8, np.nan],
[2.8, 1.2, 3.2]])
y = [10, 20, 15, 5, 18, 12]
# Initialize SimpleImputer
imputer = SimpleImputer(strategy='mean')
# Fit and transform the input features
X_imputed = imputer.fit_transform(X)
# Initialize and train XGBoost model
model = XGBRegressor(random_state=42)
model.fit(X_imputed, y)
# New data for prediction with missing values
X_new = np.array([[3.2, np.nan, 3.4],
[1.5, 0.8, np.nan]])
# Impute missing values in new data
X_new_imputed = imputer.transform(X_new)
# Make predictions
predictions = model.predict(X_new_imputed)
print("Predictions:", predictions)
Here’s a step-by-step breakdown:
Import the necessary classes:
SimpleImputerfromsklearn.imputefor imputing missing values, andXGBRegressorfromxgboostfor building the XGBoost model.Create a synthetic feature matrix
Xwith missing values denoted bynp.nan, and a corresponding target variabley.Initialize a
SimpleImputerobject with astrategyparameter set to'mean'. This tells the imputer to replace missing values with the mean value of each feature.Fit the imputer on the feature matrix
Xand transform it to fill in the missing values usingfit_transform. This step calculates the mean of each feature and replaces the missing values with these means.Initialize an
XGBRegressorwith any desired hyperparameters. Here, we set arandom_statefor reproducibility.Train the XGBoost model using the imputed feature matrix
X_imputedand the target variabley.When new data
X_newarrives with missing values, use the fitted imputer to transform and fill in the missing values usingtransform. This step applies the same imputation strategy used during training.Make predictions using the XGBoost model with the imputed new data
X_new_imputed.
In addition to the ‘mean’ strategy, SimpleImputer offers other imputation strategies such as ‘median’, ‘most_frequent’, and ‘constant’. Choose the strategy that best suits your data and problem.
It’s important to apply the same imputation strategy and imputer to both the training data and any new or test data to ensure consistency in how missing values are handled.