This example demonstrates how to evaluate an XGBoost model for time series forecasting using TimeSeriesSplit
cross-validation, highlighting the importance of using time-aware splitting for model evaluation in time series tasks.
The TimeSeriesSplit
class from scikit-learn allows us to evaluate our XGBoost model using walk-forward validation, where the model is repeatedly fit on the past data and evaluated interval predictions.
We’ll use a synthetic dataset for simplicity and reproducibility.
# XGBoosting.com
# Evaluate XGBoost for Time Series Forecasting with TimeSeriesSplit
import numpy as np
from xgboost import XGBRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
# Generate a synthetic univariate time series dataset
series = np.sin(0.1 * np.arange(200)) + np.random.randn(200) * 0.1
# Prepare the data for supervised learning
X, y = [], []
for i in range(10, len(series)):
X.append(series[i-10:i])
y.append(series[i])
X, y = np.array(X), np.array(y)
# Initialize TimeSeriesSplit for time-aware cross-validation
tscv = TimeSeriesSplit(n_splits=5)
# Initialize an XGBRegressor model
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Evaluate the model using TimeSeriesSplit
mse_scores = []
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
# Print the average performance across all splits
print(f"Average Mean Squared Error: {np.mean(mse_scores):.4f}")
This example focuses on evaluating an XGBoost model for time series forecasting using TimeSeriesSplit
cross-validation. Here’s a step-by-step breakdown:
- Generate a synthetic univariate time series dataset using a sine wave with added noise.
- Prepare the data for supervised learning by creating lagged features (here, we use the previous 10 time steps as features).
- Initialize
TimeSeriesSplit
for time-aware cross-validation with 5 splits. - Initialize an
XGBRegressor
model with chosen hyperparameters. - Evaluate the model using
TimeSeriesSplit
by iterating over the splits, fitting the model on the training data, making predictions on the test data, and calculating the Mean Squared Error (MSE) for each split. - Print the average MSE across all splits to assess the model’s overall performance.
Using TimeSeriesSplit
ensures that the model is evaluated on data that comes chronologically after the training data, mimicking a real-world scenario where future data is not available during training. This helps to assess the model’s ability to generalize to new, unseen data in a time series context.