XGBoost Interpolate Missing Values For Time Series Data

Time series data often contains missing values due to various reasons, such as sensor failures or data collection issues.

XGBoost can handle missing values in data, but this may not be the best approach for non-stationary input data, like time series.

This example demonstrates how to interpolate missing values in a time series dataset using pandas before training an XGBoost model.

# XGBoosting.com
# Interpolate Missing Values in Time Series Data for XGBoost
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Generate a synthetic time series dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=3, random_state=42)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df['target'] = y

# Randomly introduce missing values (NaNs)
num_rows, num_cols = 100, len(df.columns) - 1  # Avoid the target column
rows_to_nan = np.random.choice(df.index, size=num_rows, replace=False)
cols_to_nan = np.random.randint(0, num_cols, size=num_rows)  # Generate random indices for columns except the last one
df.iloc[rows_to_nan, cols_to_nan] = np.nan

# Interpolate missing values using pandas
df_interpolated = df.interpolate(method='linear', axis=0)

# Prepare the data for supervised learning
X = df_interpolated.iloc[:, :-1]
y = df_interpolated.iloc[:, -1]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize an XGBoostClassifier model
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Here’s what the code does step-by-step:

Generate a synthetic time series dataset using scikit-learn’s make_classification function.
Randomly introduce missing values (NaNs) into the dataset.
Use pandas’ interpolate function to fill in the missing values with the linear method.
Prepare the data for supervised learning by splitting the DataFrame into features (X) and target (y).
Split the data into train and test sets using scikit-learn’s train_test_split function.
Initialize an XGBClassifier model, fit it on the training data, and make predictions on the test set.
Evaluate the model’s performance using the accuracy_score metric.

Interpolation is a useful technique for handling missing values in time series data. Pandas provides a convenient interpolate function that supports various interpolation methods, such as linear, time, index, and more. In this example, we use linear interpolation, which fills in missing values by drawing a straight line between the previous and next valid observations.

It’s important to note that interpolation may not always be the best approach, especially if the missing values are not missing at random (MNAR) or if there are long gaps in the data. In such cases, more advanced techniques like multiple imputation or domain-specific methods may be necessary.

This example serves as a starting point for handling missing values in time series data using interpolation with pandas before training an XGBoost model. You can extend this example by experimenting with different interpolation methods, comparing the results with other missing value handling techniques, or incorporating more complex feature engineering steps.

See Also