Adding lagged versions of input variables as additional features can often improve the performance of XGBoost models for time series forecasting tasks.

This example demonstrates how to prepare a multivariate time series dataset by creating lagged features and then train and evaluate an XGBoost model on this enhanced dataset.

We’ll use a synthetic dataset generated by combining multiple univariate time series, each created using an AR(1) (autoregressive order 1) process. This will introduce autocorrelation in each variable, making it a more realistic time series dataset.

The code below shows the complete process from data generation to model evaluation.

```
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic multivariate time series data
def generate_ar1_series(n_samples, ar_coef, noise_std):
series = [np.random.randn()]
for _ in range(n_samples - 1):
series.append(ar_coef * series[-1] + np.random.randn() * noise_std)
return np.array(series)
n_series = 5
n_samples = 1000
ar_coefs = [0.8, 0.6, 0.7, 0.9, 0.5]
noise_stds = [0.1, 0.2, 0.15, 0.05, 0.25]
X = np.column_stack([generate_ar1_series(n_samples, ar_coef, noise_std)
for ar_coef, noise_std in zip(ar_coefs, noise_stds)])
y = np.sum(X, axis=1) + np.random.randn(n_samples) * 0.1
# Convert data to a DataFrame
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y
# Create lagged features
lag = 2
for i in range(1, lag + 1):
for col in df.columns[:-1]:
df[f'{col}_lag{i}'] = df[col].shift(i)
# Drop rows with missing values
df = df.dropna()
# Split the data into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']
# Chronological split of data into train and test sets
split_index = int(len(X) * 0.8) # 80% of data for training
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]
# Initialize an XGBRegressor model
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Fit the model on the training data
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error with Lagged Features: {mse:.4f}")
```

In this updated example:

- We define a function
`generate_ar1_series`

that generates a univariate time series using an AR(1) process with a specified autoregressive coefficient and noise standard deviation. - We create a synthetic multivariate time series dataset by generating multiple AR(1) series and stacking them as columns in a matrix
`X`

. The target variable`y`

is the sum of these series plus some noise. - We create a DataFrame with the original features and add lagged versions of these features. Here, we use lags of 1 and 2 time steps.
- We drop any rows with missing values that result from the lagging operation.
- We split the data chronologically into train and test sets.
- We initialize an
`XGBRegressor`

model, fit it on the training data, and make predictions on the test set. - We evaluate the model’s performance using Mean Squared Error (MSE).

By using AR(1) series, we introduce autocorrelation in each variable, making the synthetic dataset more representative of real-world time series data.

This example showcases how XGBoost can leverage lagged features to capture temporal dependencies and improve forecasting accuracy in multivariate time series with autocorrelation.