XGBoost Add Lagged Input Variables for Time Series Forecasting

Adding lagged versions of input variables as additional features can often improve the performance of XGBoost models for time series forecasting tasks.

This example demonstrates how to prepare a multivariate time series dataset by creating lagged features and then train and evaluate an XGBoost model on this enhanced dataset.

We’ll use a synthetic dataset generated by combining multiple univariate time series, each created using an AR(1) (autoregressive order 1) process. This will introduce autocorrelation in each variable, making it a more realistic time series dataset.

The code below shows the complete process from data generation to model evaluation.

import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic multivariate time series data
def generate_ar1_series(n_samples, ar_coef, noise_std):
    series = [np.random.randn()]
    for _ in range(n_samples - 1):
        series.append(ar_coef * series[-1] + np.random.randn() * noise_std)
    return np.array(series)

n_series = 5
n_samples = 1000
ar_coefs = [0.8, 0.6, 0.7, 0.9, 0.5]
noise_stds = [0.1, 0.2, 0.15, 0.05, 0.25]

X = np.column_stack([generate_ar1_series(n_samples, ar_coef, noise_std)
                     for ar_coef, noise_std in zip(ar_coefs, noise_stds)])
y = np.sum(X, axis=1) + np.random.randn(n_samples) * 0.1

# Convert data to a DataFrame
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y

# Create lagged features
lag = 2
for i in range(1, lag + 1):
    for col in df.columns[:-1]:
        df[f'{col}_lag{i}'] = df[col].shift(i)

# Drop rows with missing values
df = df.dropna()

# Split the data into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Chronological split of data into train and test sets
split_index = int(len(X) * 0.8)  # 80% of data for training
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Initialize an XGBRegressor model
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error with Lagged Features: {mse:.4f}")

In this updated example:

We define a function generate_ar1_series that generates a univariate time series using an AR(1) process with a specified autoregressive coefficient and noise standard deviation.
We create a synthetic multivariate time series dataset by generating multiple AR(1) series and stacking them as columns in a matrix X. The target variable y is the sum of these series plus some noise.
We create a DataFrame with the original features and add lagged versions of these features. Here, we use lags of 1 and 2 time steps.
We drop any rows with missing values that result from the lagging operation.
We split the data chronologically into train and test sets.
We initialize an XGBRegressor model, fit it on the training data, and make predictions on the test set.
We evaluate the model’s performance using Mean Squared Error (MSE).

By using AR(1) series, we introduce autocorrelation in each variable, making the synthetic dataset more representative of real-world time series data.

This example showcases how XGBoost can leverage lagged features to capture temporal dependencies and improve forecasting accuracy in multivariate time series with autocorrelation.

See Also