XGBoost Time Series GridSearchCV with TimeSeriesSplit

When working with time series data, it’s crucial to perform proper cross-validation to avoid temporal data leakage.

The TimeSeriesSplit class from scikit-learn enables time series-aware cross-validation, ensuring that the model is not trained on future data.

This example demonstrates how to perform hyperparameter tuning for an XGBoost model using GridSearchCV and TimeSeriesSplit on a synthetic time series dataset.

import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# Generate a synthetic time series dataset
series = np.sin(0.1 * np.arange(200)) + np.random.randn(200) * 0.1

# Prepare data for supervised learning
df = pd.DataFrame(series, columns=['value'])
df['value_lag1'] = df['value'].shift(1)
df = df.dropna()

X = df[['value_lag1']].values
y = df['value'].values

# Chronological split of data into train and test sets
split_index = int(len(X) * 0.8)  # 80% of data for training
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Define the XGBoost model
model = XGBRegressor(random_state=42)

# Define the hyperparameter search space
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

# Create a TimeSeriesSplit object for time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=tscv, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the grid search object on the training data
grid_search.fit(X_train, y_train)

# Retrieve the best model and best hyperparameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Evaluate the best model's performance on the test set
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Best Hyperparameters: {best_params}")
print(f"Test Set Mean Squared Error: {mse:.4f}")

In this example:

We generate a synthetic time series dataset and prepare it for supervised learning by creating lagged features.
We define the XGBoost model and the hyperparameter search space, specifying the values to be tested for n_estimators, learning_rate, and max_depth.
We create a TimeSeriesSplit object for time series cross-validation, ensuring that the model is evaluated on future data.
We perform grid search using GridSearchCV with the defined model, parameter grid, and TimeSeriesSplit object.
We fit the grid search object on the training data to find the best combination of hyperparameters.
We retrieve the best model and the corresponding best hyperparameters.
Finally, we evaluate the best model’s performance on the test set using Mean Squared Error (MSE) and print the results.

By using GridSearchCV with TimeSeriesSplit, we ensure that the hyperparameter tuning process respects the temporal structure of the data and avoids data leakage, resulting in a more reliable and robust model for time series forecasting.

See Also