The early_stopping_rounds
parameter in XGBoost allows for early termination of the training process if the model’s performance on a validation set does not improve for a specified number of rounds. This parameter helps prevent overfitting and saves computational resources by stopping the training when the model’s performance plateaus.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, random_state=42)
# Split the dataset into training, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
# Initialize the XGBoost classifier with early_stopping_rounds
model = XGBClassifier(n_estimators=1000, early_stopping_rounds=10, eval_metric='logloss')
# Fit the model with early stopping
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=True)
# Report the best score on the best iteration
print(f'Best score {model.best_score}, Best iteration {model.best_iteration}')
# Make predictions using the model at the best iteration
predictions = model.predict(X_test)
The best performing model as defined on the validation set is stored. It’s performance is available via the best_score
property and its iteration is available via the best_iteration
variable.
Making predictions with the fit model will use the model stored at the best_iteration
.
Understanding the “early_stopping_rounds” Parameter
The early_stopping_rounds
parameter specifies the number of rounds (iterations) to continue training after the last improvement in the model’s performance on the validation set. Early stopping helps prevent overfitting by terminating training when the model’s performance on unseen data (validation set) stops improving. This technique also saves computational resources by avoiding unnecessary iterations.
Configuring “early_stopping_rounds”
When configuring early_stopping_rounds
, there is a trade-off between allowing the model to continue learning and preventing overfitting. A smaller value may stop training too early, before the model has fully learned from the data. On the other hand, a larger value may allow the model to overfit by continuing to train after it has reached its optimal performance on the validation set.
The optimal value for early_stopping_rounds
depends on the size and complexity of the dataset and model. As a general guideline, monitor the model’s performance on the validation set during training to determine an appropriate value. If the performance plateaus for a considerable number of rounds, it may indicate that the model has reached its optimal point and further training could lead to overfitting.
Practical Tips
- Use a separate validation set for early stopping to avoid biasing the model’s performance. This validation set should be distinct from the training and test sets.
- Set
n_estimators
to a sufficiently large value in combination withearly_stopping_rounds
to allow the model to reach its full potential while still preventing overfitting. - Set
verbose=True
when fitting the model to monitor the training process and observe when early stopping occurs. - Experiment with different values of
early_stopping_rounds
and compare the model’s performance on a held-out test set to find the optimal configuration for your specific problem.