XGBoosting Home | About | Contact | Examples

Tune "num_boost_round" Parameter to xgboost.train()

The num_boost_round parameter in XGBoost’s native API (xgboost.train) controls the number of boosting rounds or trees built by the algorithm.

Tuning this parameter can significantly impact the model’s performance.

Setting it too low may result in underfitting, while setting it too high may lead to overfitting and increased training time.

Tune “num_boost_round” Manually

This example demonstrates how to manually tune the num_boost_round parameter by evaluating the model’s performance on a validation set for a range of values.

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, n_informative=10, random_state=42)

# Split data into train, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

# Define XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'learning_rate': 0.1,
    'max_depth': 3,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# Create DMatrices for train, validation, and test sets
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define the range of num_boost_round values to evaluate
num_round_values = [50, 100, 150, 200, 250, 300]

# Initialize variables to store the best results
best_num_round = None
best_accuracy = 0

# Iterate over the num_boost_round values
for num_round in num_round_values:
    # Train the model with the current num_boost_round value
    model = xgb.train(params, dtrain, num_boost_round=num_round)

    # Make predictions on the validation set
    y_pred_val = model.predict(dval)
    y_pred_val = (y_pred_val > 0.5).astype(int)

    # Calculate the validation accuracy
    val_accuracy = accuracy_score(y_val, y_pred_val)

    # Update the best results if the current model is better
    if val_accuracy > best_accuracy:
        best_num_round = num_round
        best_accuracy = val_accuracy

    # Report progress
    print(f'>{num_round}: {val_accuracy}')

# Train the final model with the best num_boost_round value
best_model = xgb.train(params, dtrain, num_boost_round=best_num_round)

# Evaluate the final model on the test set
y_pred_test = best_model.predict(dtest)
y_pred_test = (y_pred_test > 0.5).astype(int)
test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"Best num_boost_round: {best_num_round}")
print(f"Validation Accuracy: {best_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

In this example, we create a synthetic binary classification dataset and split it into training, validation, and test sets, similar to the previous example.

We define the XGBoost model parameters in the params dictionary and create DMatrix objects for the training, validation, and test sets.

We specify a range of num_boost_round values to evaluate in the num_round_values list. We initialize variables best_num_round and best_accuracy to keep track of the best results.

We iterate over the num_round_values and train an XGBoost model for each value using xgb.train. We make predictions on the validation set, calculate the validation accuracy, and update the best results if the current model outperforms the previous best.

After the loop, we train the final model using the best num_boost_round value found during the manual tuning process. We evaluate the final model’s performance on the test set by making predictions and calculating the test accuracy.

Finally, we print the best num_boost_round value, the corresponding validation accuracy, and the test accuracy of the final model.

Manual tuning allows us to explicitly control the range of values to evaluate and observe the model’s performance for each value.

However, it can be time-consuming, especially if the range of values is large. In practice, using early stopping or more advanced hyperparameter optimization techniques like grid search or Bayesian optimization can be more efficient.

Tune “num_boost_round” via Early Stopping

This example demonstrates how to use early stopping to find the optimal num_boost_round value.

Early stopping monitors the model’s performance on a validation set during training and stops the training process if the performance does not improve after a specified number of rounds.

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, n_informative=10, random_state=42)

# Split data into train, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

# Define XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'learning_rate': 0.1,
    'max_depth': 3,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# Create DMatrices for train, validation, and test sets
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define early stopping callback
early_stop = xgb.callback.EarlyStopping(rounds=10, metric_name='logloss', data_name='val', save_best=True)

# Train the model with early stopping
model = xgb.train(params, dtrain, num_boost_round=1000, evals=[(dval, 'val')], callbacks=[early_stop])

# Get the best number of boosting rounds
best_num_boost_round = model.best_iteration
print(f"Best num_boost_round: {best_num_boost_round}")

# Evaluate the model on the test set
y_pred = model.predict(dtest)
accuracy = np.mean(y_test == (y_pred > 0.5))
print(f"Test Accuracy: {accuracy:.4f}")

In this example, we create a synthetic binary classification dataset using scikit-learn’s make_classification function and split it into training, validation, and test sets.

We define the XGBoost model parameters in a dictionary called params. Then, we create DMatrix objects for the training, validation, and test sets, which are required by the xgboost.train function.

We define an early stopping callback using xgb.callback.EarlyStopping, specifying the number of rounds to wait for improvement (rounds), the evaluation metric to monitor (metric_name), the name of the validation set (data_name), and whether to save the best model (save_best).

We train the XGBoost model using xgb.train, passing the parameters, training data, maximum number of boosting rounds, evaluation sets, and the early stopping callback. The training process will stop early if the validation error does not improve for 10 consecutive rounds.

After training, we retrieve the best number of boosting rounds found by early stopping using model.best_iteration. Finally, we evaluate the model’s performance on the test set by making predictions and calculating the accuracy.

By using early stopping, we can automatically find the optimal num_boost_round value that balances the model’s performance and training time, avoiding overfitting and saving computational resources.



See Also