Check if XGBoost Is Underfitting

Underfitting occurs when a model is too simple to capture the underlying patterns in the data.

It can lead to poor performance on both training and unseen data.

Detecting underfitting is crucial for building effective XGBoost models that can make accurate predictions. Key techniques for detecting underfitting in XGBoost include comparing training and validation performance, analyzing learning curves, and examining feature importances.

Here’s a code snippet that demonstrates how to compare training and validation accuracy to detect underfitting:

from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost regressor with a small number of estimators and a high learning rate
xgb_reg = XGBRegressor(n_estimators=10, learning_rate=0.5, random_state=42)
xgb_reg.fit(X_train, y_train)

# Predict on training and validation sets
train_preds = xgb_reg.predict(X_train)
val_preds = xgb_reg.predict(X_val)

# Calculate MSE for training and validation sets
train_mse = mean_squared_error(y_train, train_preds)
val_mse = mean_squared_error(y_val, val_preds)

print(f"Training MSE: {train_mse:.4f}")
print(f"Validation MSE: {val_mse:.4f}")

# Check if both training and validation MSE are high
if train_mse > 100 and val_mse > 100:
    print("Warning: The model may be underfitting!")
    print("Consider increasing model complexity by adding more estimators, reducing learning rate, or adjusting other hyperparameters.")

In this code, we:

Generate a synthetic regression dataset using make_regression from scikit-learn.
Split the data into training and validation sets using train_test_split.
Train an XGBoost regressor with a small number of estimators (10) and a high learning rate (0.5) to simulate an underfit model.
Make predictions on both the training and validation sets.
Calculate the mean squared error (MSE) for the training and validation sets using mean_squared_error from scikit-learn.
Print the training and validation MSE.
Check if both training and validation MSE are high (in this case, greater than 100). If they are, print a warning message indicating potential underfitting and suggest increasing model complexity.

High MSE values for both training and validation sets indicate that the model is too simple and fails to capture the underlying patterns in the data.

Other methods for detecting underfitting include:

Plotting learning curves: If both training and validation scores are low and converge, it suggests that the model is underfitting.
Analyzing feature importances: If the model is not utilizing relevant features, it may be underfitting.

By regularly checking for underfitting using these techniques, you can identify when your XGBoost model is too simple and take appropriate steps to increase its complexity. This can be done by adding more estimators, reducing the learning rate, or adjusting other hyperparameters to improve the model’s ability to learn from the data and make accurate predictions.

See Also