The Root Mean Squared Logarithmic Error (RMSLE) is an evaluation metric commonly used for regression tasks where the target values are non-negative and follow an exponential growth pattern, such as population counts, sales prices, or website traffic.
RMSLE is less sensitive to outliers compared to RMSE and is a good choice when you want to penalize underestimations more than overestimations.
By setting eval_metric='rmsle'
in XGBoost, you can monitor your model’s performance using RMSLE during training and enable early stopping to prevent overfitting. Here’s an example of how to use RMSLE as the evaluation metric with XGBoost and scikit-learn:
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
# Generate a synthetic regression dataset with non-negative target values
np.random.seed(42)
X = np.random.rand(1000, 10)
y = np.exp(np.random.rand(1000))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBRegressor with RMSLE as the evaluation metric
model = XGBRegressor(n_estimators=100, eval_metric='rmsle', early_stopping_rounds=10, random_state=42)
# Train the model with early stopping
model.fit(X_train, y_train, eval_set=[(X_test, y_test)])
# Retrieve the RMSLE values from the training process
results = model.evals_result()
epochs = len(results['validation_0']['rmsle'])
x_axis = range(0, epochs)
# Plot the RMSLE values
plt.figure()
plt.plot(x_axis, results['validation_0']['rmsle'], label='Test')
plt.legend()
plt.xlabel('Number of Boosting Rounds')
plt.ylabel('RMSLE')
plt.title('XGBoost RMSLE Performance')
plt.show()
# Make predictions and evaluate the final RMSLE on the test set
y_pred = model.predict(X_test)
final_rmsle = np.sqrt(np.mean(np.square(np.log1p(y_pred) - np.log1p(y_test))))
print(f"Final RMSLE on test set: {final_rmsle:.4f}")
In this example, we generate a synthetic regression dataset using NumPy, where the target values are non-negative and follow an exponential distribution. We then split the data into training and testing sets.
We create an instance of XGBRegressor
and set eval_metric='rmsle'
to specify RMSLE as the evaluation metric. We also set early_stopping_rounds=10
to enable early stopping if the RMSLE doesn’t improve for 10 consecutive rounds.
During training, we pass the testing set as the eval_set
to monitor the model’s performance on unseen data. After training, we retrieve the RMSLE values using the evals_result()
method.
We plot the RMSLE values against the number of boosting rounds to visualize the model’s performance during training. This plot helps us assess whether the model is overfitting or underfitting and determines the optimal number of boosting rounds.
Finally, we make predictions on the test set and evaluate the final RMSLE to measure the model’s performance on unseen data.
By using RMSLE as the evaluation metric, we can effectively monitor the model’s performance on non-negative, exponentially growing target values, prevent overfitting through early stopping, and select the best model based on the lowest RMSLE value.