XGBoosting Home | About | Contact | Examples

XGBoost Assumes Stationary Time Series Data

XGBoost, like many other machine learning algorithms, assumes that the time series data is stationary.

This means that the statistical properties of the data, such as the mean and variance, remain constant over time.

However, real-world time series data often exhibits nonstationary behavior, which can lead to poor model performance if not addressed.

Fortunately, several data transformation methods can be applied to make a nonstationary time series stationary.

This example demonstrates how to use differencing, logarithmic transform, and scaling to preprocess a synthetic nonstationary time series for use with XGBoost.

from sklearn.datasets import make_friedman1
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

# Generate a synthetic nonstationary time series dataset
X, y = make_friedman1(n_samples=1000, noise=0.5, random_state=42)
y = y + 0.01 * np.arange(len(y))  # Add a linear trend to make it nonstationary

# Apply first order differencing to remove trend
diff_y = pd.Series(y).diff().dropna()

# Apply logarithmic transform to stabilize variance
log_diff_y = np.log(diff_y)

# Scale the transformed data to have zero mean and unit variance
scaler = StandardScaler()
scaled_log_diff_y = scaler.fit_transform(log_diff_y.values.reshape(-1, 1)).flatten()

# Prepare the transformed data for supervised learning with a lag of 1
df = pd.DataFrame(scaled_log_diff_y, columns=['scaled_log_diff_y'])
df['scaled_log_diff_y_lag1'] = df['scaled_log_diff_y'].shift(1)
df = df.dropna()

X = df[['scaled_log_diff_y_lag1']].values
y = df['scaled_log_diff_y'].values

# Split the data into train and test sets chronologically
split_index = int(len(X) * 0.8)  # 80% for training, 20% for testing
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Train an XGBRegressor on the transformed training data
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the transformed test data
y_pred = model.predict(X_test)

# Evaluate model performance using mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

Here’s a step-by-step breakdown of the code:

  1. Generate a synthetic nonstationary time series dataset using make_friedman1 from scikit-learn and add a linear trend to make it nonstationary.
  2. Apply first order differencing using diff() to remove the trend.
  3. Apply logarithmic transform using np.log() to stabilize the variance.
  4. Scale the transformed data to have zero mean and unit variance using StandardScaler from scikit-learn.
  5. Prepare the transformed data for supervised learning with a lag of 1 by creating a DataFrame with the scaled log-differenced series and a lagged feature.
  6. Split the data chronologically into train and test sets.
  7. Train an XGBRegressor on the transformed training data.
  8. Make predictions on the transformed test data.
  9. Evaluate the model’s performance using Mean Squared Error (MSE).

This example demonstrates how to combine multiple data transformation techniques to make a nonstationary time series stationary for use with XGBoost. By applying differencing, logarithmic transform, and scaling, we can remove trend, stabilize variance, and normalize the data, respectively.

It’s important to note that the choice of transformation methods depends on the characteristics of the time series data and the problem at hand. Always visualize your data and consider the problem context before applying any transformations.



See Also