When working with date variables in XGBoost, feature engineering can significantly improve model performance.
By extracting relevant information from dates, such as year, month, day, and the difference between dates, we can provide the model with more informative features that capture temporal patterns and relationships.
In this example, we’ll demonstrate the impact of feature engineering date variables on XGBoost’s performance using a synthetic dataset with two date columns and a target variable influenced by the dates.
We’ll compare the performance of two XGBoost models: one trained on the dataset with dates as numerical values and another trained on the dataset with feature engineered date columns.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
# Generate a synthetic dataset with two date columns and a target variable
def generate_dataset(num_samples):
start_date = pd.to_datetime('2020-01-01')
end_date = pd.to_datetime('2022-12-31')
date_range = pd.date_range(start_date, end_date, freq='D')
date1 = np.random.choice(date_range, num_samples)
date2 = [d + pd.Timedelta(days=np.random.randint(1, 365)) for d in date1]
X = pd.DataFrame({'date1': date1, 'date2': date2})
y = np.random.randn(num_samples) + (X['date2'] - X['date1']).dt.days / 100
return X, y
# Generate the dataset
X, y = generate_dataset(1000)
# Create a copy of the dataset with feature engineered date columns
X_engineered = X.copy()
X_engineered['date1_year'] = X_engineered['date1'].dt.year
X_engineered['date1_month'] = X_engineered['date1'].dt.month
X_engineered['date1_day'] = X_engineered['date1'].dt.day
X_engineered['date2_year'] = X_engineered['date2'].dt.year
X_engineered['date2_month'] = X_engineered['date2'].dt.month
X_engineered['date2_day'] = X_engineered['date2'].dt.day
X_engineered['date_diff'] = (X_engineered['date2'] - X_engineered['date1']).dt.days
X_engineered.drop(['date1', 'date2'], axis=1, inplace=True)
# Convert date columns to numerical values (number of days since a reference date)
X['date1'] = (X['date1'] - pd.to_datetime('2020-01-01')).dt.days
X['date2'] = (X['date2'] - pd.to_datetime('2020-01-01')).dt.days
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_engineered, X_test_engineered, _, _ = train_test_split(X_engineered, y, test_size=0.2, random_state=42)
# Train an XGBoost model on the original dataset
model_original = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)
# Train an XGBoost model on the feature engineered dataset
model_engineered = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model_engineered.fit(X_train_engineered, y_train)
# Evaluate the models' performance using RMSE
y_pred_original = model_original.predict(X_test)
y_pred_engineered = model_engineered.predict(X_test_engineered)
rmse_original = np.sqrt(mean_squared_error(y_test, y_pred_original))
rmse_engineered = np.sqrt(mean_squared_error(y_test, y_pred_engineered))
print(f"RMSE (Original): {rmse_original:.4f}")
print(f"RMSE (Feature Engineered): {rmse_engineered:.4f}")
You may see results that look like the following:
RMSE (Original): 1.1185
RMSE (Feature Engineered): 1.0792
In this example, we generate a synthetic dataset using the generate_dataset
function, which creates two date columns (date1
and date2
) and a target variable influenced by the difference between the dates. We then create two versions of the dataset: one with dates as numerical values (number of days since a reference date) and another with feature engineered date columns (year, month, day, and date difference).
Next, we train two XGBoost regressors with the same hyperparameters on both datasets. We evaluate the performance of both models using the Root Mean Squared Error (RMSE) as the evaluation metric, which measures the average magnitude of the residuals (prediction errors).
Finally, we print the RMSE scores of both models to compare their performance.
The impact of feature engineering date variables on XGBoost’s performance can vary depending on the specific dataset and problem at hand. In this synthetic example, the feature engineered dataset is likely to yield better results as it provides the model with more informative features that capture the temporal patterns and relationships between the dates and the target variable.
When working with real-world datasets containing date variables, it’s essential to experiment with different feature engineering techniques and evaluate their impact on the model’s performance to determine the best approach for your specific use case.