The grow_policy
parameter in XGBoost determines the strategy used for growing trees during the training process.
It can have a significant impact on the model’s performance and training speed. By tuning the grow_policy
, you can find the best setting for your specific problem and dataset.
This example demonstrates how to compare different grow_policy
settings using a synthetic multiclass classification dataset.
import xgboost as xgb
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Create a synthetic dataset
X, y = make_classification(n_samples=10000, n_classes=5, n_features=20, n_informative=10, random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the grow_policy settings to compare
grow_policies = ['depthwise', 'lossguide']
# Train and evaluate models with different grow_policy settings
for policy in grow_policies:
start_time = time.time()
# Set up XGBoost classifier
model = xgb.XGBClassifier(n_estimators=100, grow_policy=policy, random_state=42)
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
end_time = time.time()
training_time = end_time - start_time
print(f"grow_policy: {policy}")
print(f"Accuracy: {accuracy:.4f}")
print(f"Training Time: {training_time:.2f} seconds")
print()
In this example, we create a synthetic multiclass classification dataset using scikit-learn’s make_classification
function with 10,000 samples, 5 classes, and 20 features. We split the data into training and testing sets using train_test_split
.
We define a list grow_policies
that contains the two possible settings for the grow_policy
parameter: 'depthwise'
and 'lossguide'
.
We iterate over the grow_policies
list and train an XGBoost classifier for each setting. We set n_estimators=100
and random_state=42
for reproducibility. The grow_policy
parameter is set to the current policy being evaluated.
For each model, we measure the training time using the time
module. We train the model using model.fit(X_train, y_train)
and make predictions on the test set using model.predict(X_test)
. We calculate the accuracy of the predictions using accuracy_score
from scikit-learn.
Finally, we print the grow_policy
setting, the achieved accuracy, and the training time for each model.
By comparing the results, you can assess the impact of the grow_policy
setting on the model’s performance and training time. In this example, the 'lossguide'
policy achieved a slightly higher accuracy but took marginally longer to train compared to the 'depthwise'
policy.
Keep in mind that the optimal grow_policy
setting may depend on your specific dataset and problem. It’s recommended to experiment with both settings and evaluate their performance on your particular use case.