When training an XGBoost model on targets that follow a gamma distribution, using the negative log-likelihood of the gamma distribution (“gamma-nloglik”) as the evaluation metric can lead to better performance.
This metric is particularly useful for tasks such as insurance claim severity modeling or customer lifetime value prediction, where the target variable is typically non-negative and right-skewed.
By setting eval_metric='gamma-nloglik'
, you can monitor your model’s performance during training and enable early stopping to prevent overfitting. Here’s an example of how to use “gamma-nloglik” as the evaluation metric with XGBoost and scikit-learn:
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
# Generate a synthetic gamma-distributed target variable
shape, scale = 2, 2
y = np.random.gamma(shape, scale, 1000)
# Create a random feature matrix
X = np.random.rand(1000, 10)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBRegressor with "gamma-nloglik" as the evaluation metric
model = XGBRegressor(n_estimators=100, eval_metric='gamma-nloglik', early_stopping_rounds=10, random_state=42)
# Train the model with early stopping
model.fit(X_train, y_train, eval_set=[(X_test, y_test)])
# Retrieve the "gamma-nloglik" values from the training process
results = model.evals_result()
epochs = len(results['validation_0']['gamma-nloglik'])
x_axis = range(0, epochs)
# Plot the "gamma-nloglik" values
plt.figure()
plt.plot(x_axis, results['validation_0']['gamma-nloglik'], label='Test')
plt.legend()
plt.xlabel('Number of Boosting Rounds')
plt.ylabel('Gamma Negative Log-Likelihood')
plt.title('XGBoost Gamma Negative Log-Likelihood Performance')
plt.show()
In this example, we generate a synthetic gamma-distributed target variable using NumPy’s random.gamma
function. We then create a random feature matrix and split the data into training and testing sets.
We create an instance of XGBRegressor
and set eval_metric='gamma-nloglik'
to specify the gamma negative log-likelihood as the evaluation metric. We also set early_stopping_rounds=10
to enable early stopping if the metric doesn’t improve for 10 consecutive rounds.
During training, we pass the testing set as the eval_set
to monitor the model’s performance on unseen data. After training, we retrieve the “gamma-nloglik” values using the evals_result()
method.
Finally, we plot the “gamma-nloglik” values against the number of boosting rounds to visualize the model’s performance during training. This plot helps us assess whether the model is overfitting or underfitting and determines the optimal number of boosting rounds.
By using “gamma-nloglik” as the evaluation metric, we can effectively train XGBoost models on gamma-distributed targets, prevent overfitting through early stopping, and select the best model based on the lowest negative log-likelihood value.