XGBoost Configure "tweedie-nloglik" Eval Metric

When working with regression problems where the target variable follows a Tweedie distribution, using the negative log-likelihood of the Tweedie distribution as the evaluation metric can lead to better model performance.

The Tweedie distribution is a family of probability distributions that includes the Poisson, Gamma, and inverse Gaussian distributions, making it suitable for modeling non-negative, right-skewed data with varying degrees of dispersion.

XGBoost provides the "tweedie-nloglik" evaluation metric, which is specifically designed for such problems. To use this metric, you need to set the objective parameter to "reg:tweedie" and specify the appropriate tweedie_variance_power value based on your data’s characteristics.

Here’s an example of how to use "tweedie-nloglik" as the evaluation metric with XGBoost and scikit-learn:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import numpy as np
import matplotlib.pyplot as plt

# Generate a synthetic regression dataset with Tweedie-distributed targets
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
y = np.abs(y) # Ensure non-negative targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBRegressor with "tweedie-nloglik" as the evaluation metric
model = XGBRegressor(objective="reg:tweedie", tweedie_variance_power=1.5,
                     n_estimators=100, eval_metric="tweedie-nloglik@1.5",
                     early_stopping_rounds=10, random_state=42)

# Train the model with early stopping
model.fit(X_train, y_train, eval_set=[(X_test, y_test)])

# Retrieve the "tweedie-nloglik" values from the training process
results = model.evals_result()
epochs = len(results['validation_0']['tweedie-nloglik@1.5'])
x_axis = range(0, epochs)

# Plot the "tweedie-nloglik" values
plt.figure()
plt.plot(x_axis, results['validation_0']['tweedie-nloglik@1.5'], label='Test')
plt.legend()
plt.xlabel('Number of Boosting Rounds')
plt.ylabel('Tweedie Negative Log-Likelihood')
plt.title('XGBoost Tweedie Negative Log-Likelihood Performance')
plt.show()

In this example, we generate a synthetic regression dataset using scikit-learn’s make_regression function and apply an exponential transformation to ensure non-negative targets, which is a requirement for the Tweedie distribution.

We create an instance of XGBRegressor with objective="reg:tweedie" and eval_metric="tweedie-nloglik". The tweedie_variance_power is set to 1.5, which corresponds to the Compound Poisson-Gamma distribution. The optimal value for this parameter depends on your data and can be determined through cross-validation or domain knowledge.

The "tweedie-nloglik" metric must be configured for a specific tweedie variance power. This can be achieved by setting the metric to "tweedie-nloglik" followed by an "@" and the power value, for example: "tweedie-nloglik@1.5".

During training, we use early stopping based on the "tweedie-nloglik" metric to prevent overfitting. The eval_set parameter is used to monitor the model’s performance on the test set.

After training, we retrieve the "tweedie-nloglik" values using the evals_result() method and plot them against the number of boosting rounds. The plot helps us assess the model’s performance and select the optimal number of rounds based on the minimum "tweedie-nloglik" value.

By using "tweedie-nloglik" as the evaluation metric, we can effectively optimize our XGBoost model for Tweedie-distributed targets, leading to better predictions and model performance.

See Also