When working with time series data, it’s crucial to perform proper cross-validation to avoid temporal data leakage.
The TimeSeriesSplit
class from scikit-learn enables time series-aware cross-validation, ensuring that the model is not trained on future data.
This example demonstrates how to perform hyperparameter tuning for an XGBoost model using GridSearchCV
and TimeSeriesSplit
on a synthetic time series dataset.
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_squared_error
# Generate a synthetic time series dataset
series = np.sin(0.1 * np.arange(200)) + np.random.randn(200) * 0.1
# Prepare data for supervised learning
df = pd.DataFrame(series, columns=['value'])
df['value_lag1'] = df['value'].shift(1)
df = df.dropna()
X = df[['value_lag1']].values
y = df['value'].values
# Chronological split of data into train and test sets
split_index = int(len(X) * 0.8) # 80% of data for training
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]
# Define the XGBoost model
model = XGBRegressor(random_state=42)
# Define the hyperparameter search space
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7]
}
# Create a TimeSeriesSplit object for time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)
# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=tscv, scoring='neg_mean_squared_error', n_jobs=-1)
# Fit the grid search object on the training data
grid_search.fit(X_train, y_train)
# Retrieve the best model and best hyperparameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
# Evaluate the best model's performance on the test set
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Best Hyperparameters: {best_params}")
print(f"Test Set Mean Squared Error: {mse:.4f}")
In this example:
- We generate a synthetic time series dataset and prepare it for supervised learning by creating lagged features.
- We define the XGBoost model and the hyperparameter search space, specifying the values to be tested for
n_estimators
,learning_rate
, andmax_depth
. - We create a
TimeSeriesSplit
object for time series cross-validation, ensuring that the model is evaluated on future data. - We perform grid search using
GridSearchCV
with the defined model, parameter grid, andTimeSeriesSplit
object. - We fit the grid search object on the training data to find the best combination of hyperparameters.
- We retrieve the best model and the corresponding best hyperparameters.
- Finally, we evaluate the best model’s performance on the test set using Mean Squared Error (MSE) and print the results.
By using GridSearchCV
with TimeSeriesSplit
, we ensure that the hyperparameter tuning process respects the temporal structure of the data and avoids data leakage, resulting in a more reliable and robust model for time series forecasting.