XGBoost’s built-in regularization techniques make it well-suited for small datasets.
Regularization helps prevent overfitting, a common issue when training models on limited data.
By adjusting regularization parameters, users can find the right balance between model complexity and generalization.
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize XGBoost classifier with regularization parameters
model = XGBClassifier(n_estimators=100,
learning_rate=0.1,
max_depth=3,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1,
random_state=42)
# Train the model
model.fit(X_train, y_train)
# Evaluate performance on test set
accuracy = model.score(X_test, y_test)
print(f"Test accuracy: {accuracy:.2f}")
In this example, we use the Iris dataset, a small, well-known dataset for classification, and split it in half for train and half for test. We then initialize an XGBoost classifier with several regularization parameters:
max_depth
andmin_child_weight
control the complexity of individual treesgamma
specifies the minimum loss reduction required to make a splitsubsample
andcolsample_bytree
introduce randomness by sampling observations and features, respectivelyreg_alpha
andreg_lambda
are L1 and L2 regularization terms, respectively
By tuning these parameters, you can control the model’s complexity and prevent overfitting. The optimal values for these parameters will depend on your specific dataset and problem.
After training the model, we evaluate its performance on the test set to get an estimate of how well it generalizes to unseen data.
Note that while this example demonstrates the use of regularization parameters, the specific values used here may not be optimal for your dataset. It’s recommended to use techniques like grid search or randomized search to find the best combination of regularization parameters for your specific problem.