The subsample
parameter in XGBoost controls the fraction of observations used for each tree.
By adjusting subsample
, you can influence the model’s performance and its ability to generalize.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the XGBoost classifier with a subsample value
model = XGBClassifier(subsample=0.8, eval_metric='logloss')
# Fit the model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
Understanding the “subsample” Parameter
The subsample
parameter determines the fraction of observations to be randomly sampled for each tree during the model’s training process. It is a regularization technique that can help prevent overfitting by introducing randomness into the training data. subsample
accepts values between 0 and 1, with 1 meaning that all observations are used for each tree. The default value of subsample
in XGBoost is 1.
Choosing the Right “subsample” Value
The value of subsample
affects the model’s performance and its propensity to overfit:
- Lower
subsample
values introduce more randomness into the training process by using only a fraction of the observations for each tree. This can help prevent overfitting by reducing the model’s reliance on specific observations. However, it may also slow down the learning process, as the model sees fewer examples per tree. - Higher
subsample
values use more observations per tree, potentially improving the model’s performance by allowing it to learn from a larger portion of the data. However, settingsubsample
too high can increase the risk of overfitting, as the model may start to memorize noise in the training data.
When setting subsample
, consider the trade-off between model performance and overfitting:
- A lower value can reduce overfitting but may require more trees to achieve the same level of performance.
- A higher value can lead to faster learning but may overfit if set too high.
Practical Tips
- Start with the default
subsample
value (1) and adjust it based on the model’s performance on a validation set. - Use cross-validation to find the optimal
subsample
value that strikes a balance between model performance and overfitting. - Keep in mind that
subsample
interacts with other parameters, such ascolsample_bytree
(which controls the fraction of features used per tree) and the number of trees in the model. - Monitor your model’s performance on a separate validation set to detect signs of overfitting (high training performance, low validation performance).