The sample_type
parameter in XGBoost’s Dart Booster controls how dropped trees are selected during the model training process.
There are two options for this parameter:
'uniform'
(default): Dropped trees are selected uniformly at random. Each tree has an equal probability of being dropped.'weighted'
: Dropped trees are selected in proportion to their weights. Trees with higher weights are more likely to be dropped.
The choice of sample_type
can impact the model’s performance and generalization ability.
Let’s demonstrate this using a synthetic multiclass classification dataset:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# Generate a synthetic multiclass classification dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=3, n_redundant=1,
n_features=5, random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize two XGBClassifier models with different sample_type settings
model_uniform = XGBClassifier(booster='dart', sample_type='uniform', rate_drop=0.1,
skip_drop=0.5, random_state=42)
model_weighted = XGBClassifier(booster='dart', sample_type='weighted', rate_drop=0.1,
skip_drop=0.5, random_state=42)
# Train the models
model_uniform.fit(X_train, y_train)
model_weighted.fit(X_train, y_train)
# Make predictions on test set
pred_uniform = model_uniform.predict(X_test)
pred_weighted = model_weighted.predict(X_test)
# Calculate accuracy scores
acc_uniform = accuracy_score(y_test, pred_uniform)
acc_weighted = accuracy_score(y_test, pred_weighted)
print(f"Accuracy (sample_type='uniform'): {acc_uniform:.4f}")
print(f"Accuracy (sample_type='weighted'): {acc_weighted:.4f}")
In this example, we generate a synthetic multiclass classification dataset using scikit-learn’s make_classification()
function. We then split the data into training and testing sets.
Next, we initialize two XGBClassifier
models with the Dart Booster, setting sample_type='uniform'
for one model and sample_type='weighted'
for the other. We keep the other hyperparameters the same for both models.
We train both models on the same training data using the fit()
method, make predictions on the test set using predict()
, and calculate the accuracy scores using scikit-learn’s accuracy_score()
function.
Finally, we print the accuracy scores for both models to compare their performance.
By running this example and comparing the accuracy scores, you can see how the choice of sample_type
affects the model’s performance for this specific dataset. Experiment with different datasets and hyperparameter settings to gain a better understanding of when to use 'uniform'
or 'weighted'
sampling in your XGBoost Dart Booster models.