When working with regression problems where the target variable follows a Tweedie distribution, using the negative log-likelihood of the Tweedie distribution as the evaluation metric can lead to better model performance.
The Tweedie distribution is a family of probability distributions that includes the Poisson, Gamma, and inverse Gaussian distributions, making it suitable for modeling non-negative, right-skewed data with varying degrees of dispersion.
XGBoost provides the "tweedie-nloglik"
evaluation metric, which is specifically designed for such problems. To use this metric, you need to set the objective
parameter to "reg:tweedie"
and specify the appropriate tweedie_variance_power
value based on your data’s characteristics.
Here’s an example of how to use "tweedie-nloglik"
as the evaluation metric with XGBoost and scikit-learn:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import numpy as np
import matplotlib.pyplot as plt
# Generate a synthetic regression dataset with Tweedie-distributed targets
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
y = np.abs(y) # Ensure non-negative targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBRegressor with "tweedie-nloglik" as the evaluation metric
model = XGBRegressor(objective="reg:tweedie", tweedie_variance_power=1.5,
n_estimators=100, eval_metric="tweedie-nloglik@1.5",
early_stopping_rounds=10, random_state=42)
# Train the model with early stopping
model.fit(X_train, y_train, eval_set=[(X_test, y_test)])
# Retrieve the "tweedie-nloglik" values from the training process
results = model.evals_result()
epochs = len(results['validation_0']['tweedie-nloglik@1.5'])
x_axis = range(0, epochs)
# Plot the "tweedie-nloglik" values
plt.figure()
plt.plot(x_axis, results['validation_0']['tweedie-nloglik@1.5'], label='Test')
plt.legend()
plt.xlabel('Number of Boosting Rounds')
plt.ylabel('Tweedie Negative Log-Likelihood')
plt.title('XGBoost Tweedie Negative Log-Likelihood Performance')
plt.show()
In this example, we generate a synthetic regression dataset using scikit-learn’s make_regression
function and apply an exponential transformation to ensure non-negative targets, which is a requirement for the Tweedie distribution.
We create an instance of XGBRegressor
with objective="reg:tweedie"
and eval_metric="tweedie-nloglik"
. The tweedie_variance_power
is set to 1.5, which corresponds to the Compound Poisson-Gamma distribution. The optimal value for this parameter depends on your data and can be determined through cross-validation or domain knowledge.
The "tweedie-nloglik"
metric must be configured for a specific tweedie variance power. This can be achieved by setting the metric to "tweedie-nloglik"
followed by an "@"
and the power value, for example: "tweedie-nloglik@1.5"
.
During training, we use early stopping based on the "tweedie-nloglik"
metric to prevent overfitting. The eval_set
parameter is used to monitor the model’s performance on the test set.
After training, we retrieve the "tweedie-nloglik"
values using the evals_result()
method and plot them against the number of boosting rounds. The plot helps us assess the model’s performance and select the optimal number of rounds based on the minimum "tweedie-nloglik"
value.
By using "tweedie-nloglik"
as the evaluation metric, we can effectively optimize our XGBoost model for Tweedie-distributed targets, leading to better predictions and model performance.