XGBoost Difference Transform Time Series Data

Many real-world time series exhibit nonstationary behavior, where statistical properties like the mean and variance change over time.

However, most forecasting models, including XGBoost, assume stationarity. One way to handle nonstationarity is to apply differencing, which computes the difference between consecutive observations, effectively removing trend and seasonality.

This example demonstrates how to use differencing to make a nonstationary univariate time series stationary, prepare the differenced data for supervised learning with lagged features, and train an XGBoost model to forecast future values.

# XGBoosting.com
# Apply Differencing to Make a Time Series Stationary for XGBoost Forecasting
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Generate a nonstationary synthetic time series dataset
series = np.sin(0.05 * np.arange(200)) + 0.1 * np.arange(200) + np.random.randn(200) * 0.1

# Apply differencing to make the series stationary
diff_series = pd.Series(series).diff().dropna()

# Prepare data for supervised learning
df = pd.DataFrame(diff_series, columns=['diff_value'])
df['diff_value_lag1'] = df['diff_value'].shift(1)
df = df.dropna()

X = df[['diff_value_lag1']].values
y = df['diff_value'].values

# Chronological split of data into train and test sets
split_index = int(len(X) * 0.8)  # 80% of data for training
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Initialize an XGBRegressor model
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

Here’s what the code does step-by-step:

Generate a nonstationary synthetic time series using a sine wave with a linear trend and added noise.
Apply differencing using diff() to make the series stationary.
Prepare the differenced data for supervised learning by creating a DataFrame with the differenced series and a lagged feature (lag of 1).
Split the data chronologically into train and test sets.
Initialize an XGBRegressor model, fit it on the training data, and make predictions on the test set.
Evaluate the model’s performance using Mean Squared Error (MSE).

Differencing is a powerful technique for making nonstationary time series stationary, which is a requirement for many forecasting models. However, it’s important to note that differencing can sometimes lead to information loss, especially if the series has important long-term dependencies. Always visualize your data and consider the problem context before applying differencing.

This example provides a starting point for using differencing with XGBoost for nonstationary time series forecasting. You can extend it to handle more complex scenarios by experimenting with different differencing orders, incorporating additional features, or using more advanced model architectures.

See Also