Improving the accuracy of your XGBoost models is essential for achieving better predictions.
Here are 7 powerful techniques you can use:
Hyperparameter Tuning
Fine-tuning hyperparameters can significantly improve model accuracy. Key hyperparameters include:
n_estimators
: Number of boosting rounds. More trees can improve accuracy but may lead to overfitting.learning_rate
: Controls the contribution of each tree. Lower values can improve performance but require more boosting rounds.max_depth
: Maximum depth of a tree. Deeper trees can capture more complex patterns but can overfit.min_child_weight
: Minimum sum of instance weight needed in a child. Higher values can prevent overfitting.subsample
: Fraction of samples used for training each tree. Lower values prevent overfitting.colsample_bytree
: Fraction of features used for each tree. Reducing this can prevent overfitting.
Use grid search or random search for hyperparameter tuning.
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'min_child_weight': [1, 3, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0]
}
# Initialize XGBoost model
model = xgb.XGBClassifier()
# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=3, verbose=1)
grid_search.fit(X_train, y_train)
# Output best parameters
print(grid_search.best_params_)
Feature Engineering
Creating new features or transforming existing ones can enhance model performance. Techniques include:
- Interaction Features: Combining two or more features.
- Polynomial Features: Raising features to a power.
- Encoding Categorical Variables: One-hot encoding or target encoding.
Use libraries like pandas
and sklearn
for feature engineering.
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Interaction Features
X_train['new_feature'] = X_train['mean radius'] * X_train['mean texture']
X_test['new_feature'] = X_test['mean radius'] * X_test['mean texture']
# Polynomial Features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
Handling Class Imbalance
For imbalanced datasets, consider techniques such as:
- Resampling: Oversampling the minority class or undersampling the majority class.
- Class Weights: Using the
scale_pos_weight
parameter in XGBoost.
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Resampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# Class Weights
model = xgb.XGBClassifier(scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]))
model.fit(X_train_resampled, y_train_resampled)
# Evaluate model
preds = model.predict(X_test)
accuracy = (preds == y_test).mean()
print(f"Accuracy: {accuracy}")
Early Stopping
Using early stopping can prevent overfitting by halting training when performance on a validation set stops improving.
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with early stopping
model = xgb.XGBClassifier()
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=10, verbose=True)
# Evaluate model
preds = model.predict(X_valid)
accuracy = (preds == y_valid).mean()
print(f"Accuracy: {accuracy}")
Ensembling
Combining multiple models can improve performance through techniques like bagging, boosting, or stacking.
- Bagging: Training multiple XGBoost models on different subsets of the data.
- Stacking: Training a meta-model on the predictions of several base models.
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base models
model1 = xgb.XGBClassifier()
model2 = xgb.XGBClassifier(n_estimators=200, learning_rate=0.05)
# Define stacking model
stacking_model = StackingClassifier(estimators=[('xgb1', model1), ('xgb2', model2)], final_estimator=LogisticRegression())
# Train stacking model
stacking_model.fit(X_train, y_train)
# Evaluate model
preds = stacking_model.predict(X_test)
accuracy = (preds == y_test).mean()
print(f"Accuracy: {accuracy}")
Regularization
Applying regularization can help prevent overfitting:
alpha
(reg_alpha
): L1 regularization term on weights.lambda
(reg_lambda
): L2 regularization term on weights.
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with regularization
model = xgb.XGBClassifier(reg_alpha=0.1, reg_lambda=1.0)
model.fit(X_train, y_train)
# Evaluate model
preds = model.predict(X_test)
accuracy = (preds == y_test).mean()
print(f"Accuracy: {accuracy}")
Learning Rate Scheduling
Adjusting the learning rate during training can lead to better convergence.
This can be achieved using a learning rate schedule.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the learning rate schedule
def custom_learning_rate(current_iter):
base_learning_rate = 0.1
lr = base_learning_rate * (0.95 (current_iter // 10))
return lr
# Create LearningRateScheduler callback
lr_scheduler = xgb.callback.LearningRateScheduler(custom_learning_rate)
# Create the XGBoost classifier with the learning rate schedule
model = xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
random_state=42,
callbacks=[lr_scheduler]
)
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
By creating new features, tuning hyperparameters, combining multiple models, using regularization, and implementing early stopping, you can significantly improve your model’s performance compared to a baseline model.