XGBoosting Home | About | Contact | Examples

XGBoost Incremental Round Ablation via "iteration_range"

Training an XGBoost model for a large number of rounds and then selecting the optimal number of rounds using a validation set can help prevent overfitting and choose the best model.

This example demonstrates how to use the iteration_range parameter to evaluate the model with different numbers of training rounds and select the one that yields the best validation accuracy.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Generate synthetic multi-class classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=2,
                           n_classes=3, weights=[0.5, 0.3, 0.2], random_state=42)

# Split data into train, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Initialize XGBoost classifier and train for a large number of rounds
model = XGBClassifier(n_estimators=500, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model on the validation set with different numbers of rounds
num_rounds_list = [50, 100, 200, 300, 400, 500]
val_accuracies = []

for num_rounds in num_rounds_list:
    y_val_pred = model.predict(X_val, iteration_range=(0, num_rounds))
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_accuracies.append(val_accuracy)
    print(f"Rounds: {num_rounds}, Validation Accuracy: {val_accuracy:.4f}")

# Select the optimal number of rounds
best_num_rounds = num_rounds_list[val_accuracies.index(max(val_accuracies))]
print(f"\nBest number of rounds: {best_num_rounds}")

# Retrain the final model with the optimal number of rounds
final_model = XGBClassifier(n_estimators=best_num_rounds, random_state=42)
final_model.fit(X_train, y_train)

# Evaluate the final model on the test set
y_test_pred = final_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")

The key steps:

  1. Generate a synthetic multi-class classification dataset using make_classification from scikit-learn.

  2. Split the data into training, validation, and test sets.

  3. Initialize an XGBoost classifier and train it for a large number of rounds (e.g., 500).

  4. Evaluate the model on the validation set with different numbers of rounds using the iteration_range parameter. Store the validation accuracies for each number of rounds.

  5. Select the number of rounds that gives the best validation accuracy.

  6. Retrain the final model with the optimal number of rounds.

  7. Evaluate the final model on the test set and print the test accuracy.

By training the model for a large number of rounds and then selecting the best number of rounds based on validation accuracy, you can identify the optimal point at which the model generalizes well without overfitting. The iteration_range parameter allows you to efficiently evaluate the model’s performance at different stages of training without retraining from scratch.

Note that the specific values used for the number of rounds and the dataset parameters can be adjusted based on your specific requirements and dataset characteristics.



See Also