Tune XGBoost "grow_policy" Parameter

The grow_policy parameter in XGBoost determines the strategy used for growing trees during the training process.

It can have a significant impact on the model’s performance and training speed. By tuning the grow_policy, you can find the best setting for your specific problem and dataset.

This example demonstrates how to compare different grow_policy settings using a synthetic multiclass classification dataset.

import xgboost as xgb
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a synthetic dataset
X, y = make_classification(n_samples=10000, n_classes=5, n_features=20, n_informative=10, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the grow_policy settings to compare
grow_policies = ['depthwise', 'lossguide']

# Train and evaluate models with different grow_policy settings
for policy in grow_policies:
    start_time = time.time()

    # Set up XGBoost classifier
    model = xgb.XGBClassifier(n_estimators=100, grow_policy=policy, random_state=42)

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)

    end_time = time.time()
    training_time = end_time - start_time

    print(f"grow_policy: {policy}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Training Time: {training_time:.2f} seconds")
    print()

In this example, we create a synthetic multiclass classification dataset using scikit-learn’s make_classification function with 10,000 samples, 5 classes, and 20 features. We split the data into training and testing sets using train_test_split.

We define a list grow_policies that contains the two possible settings for the grow_policy parameter: 'depthwise' and 'lossguide'.

We iterate over the grow_policies list and train an XGBoost classifier for each setting. We set n_estimators=100 and random_state=42 for reproducibility. The grow_policy parameter is set to the current policy being evaluated.

For each model, we measure the training time using the time module. We train the model using model.fit(X_train, y_train) and make predictions on the test set using model.predict(X_test). We calculate the accuracy of the predictions using accuracy_score from scikit-learn.

Finally, we print the grow_policy setting, the achieved accuracy, and the training time for each model.

By comparing the results, you can assess the impact of the grow_policy setting on the model’s performance and training time. In this example, the 'lossguide' policy achieved a slightly higher accuracy but took marginally longer to train compared to the 'depthwise' policy.

Keep in mind that the optimal grow_policy setting may depend on your specific dataset and problem. It’s recommended to experiment with both settings and evaluate their performance on your particular use case.

See Also