Improve XGBoost Model Accuracy (Skill)

Improving the accuracy of your XGBoost models is essential for achieving better predictions.

Here are 7 powerful techniques you can use:

Hyperparameter Tuning

Fine-tuning hyperparameters can significantly improve model accuracy. Key hyperparameters include:

n_estimators: Number of boosting rounds. More trees can improve accuracy but may lead to overfitting.
learning_rate: Controls the contribution of each tree. Lower values can improve performance but require more boosting rounds.
max_depth: Maximum depth of a tree. Deeper trees can capture more complex patterns but can overfit.
min_child_weight: Minimum sum of instance weight needed in a child. Higher values can prevent overfitting.
subsample: Fraction of samples used for training each tree. Lower values prevent overfitting.
colsample_bytree: Fraction of features used for each tree. Reducing this can prevent overfitting.

Use grid search or random search for hyperparameter tuning.

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Initialize XGBoost model
model = xgb.XGBClassifier()

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=3, verbose=1)
grid_search.fit(X_train, y_train)

# Output best parameters
print(grid_search.best_params_)

Feature Engineering

Creating new features or transforming existing ones can enhance model performance. Techniques include:

Interaction Features: Combining two or more features.
Polynomial Features: Raising features to a power.
Encoding Categorical Variables: One-hot encoding or target encoding.

Use libraries like pandas and sklearn for feature engineering.

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Interaction Features
X_train['new_feature'] = X_train['mean radius'] * X_train['mean texture']
X_test['new_feature'] = X_test['mean radius'] * X_test['mean texture']

# Polynomial Features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

Handling Class Imbalance

For imbalanced datasets, consider techniques such as:

Resampling: Oversampling the minority class or undersampling the majority class.
Class Weights: Using the scale_pos_weight parameter in XGBoost.

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Resampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Class Weights
model = xgb.XGBClassifier(scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]))
model.fit(X_train_resampled, y_train_resampled)

# Evaluate model
preds = model.predict(X_test)
accuracy = (preds == y_test).mean()
print(f"Accuracy: {accuracy}")

Early Stopping

Using early stopping can prevent overfitting by halting training when performance on a validation set stops improving.

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with early stopping
model = xgb.XGBClassifier()
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=10, verbose=True)

# Evaluate model
preds = model.predict(X_valid)
accuracy = (preds == y_valid).mean()
print(f"Accuracy: {accuracy}")

Ensembling

Combining multiple models can improve performance through techniques like bagging, boosting, or stacking.

Bagging: Training multiple XGBoost models on different subsets of the data.
Stacking: Training a meta-model on the predictions of several base models.

import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
model1 = xgb.XGBClassifier()
model2 = xgb.XGBClassifier(n_estimators=200, learning_rate=0.05)

# Define stacking model
stacking_model = StackingClassifier(estimators=[('xgb1', model1), ('xgb2', model2)], final_estimator=LogisticRegression())

# Train stacking model
stacking_model.fit(X_train, y_train)

# Evaluate model
preds = stacking_model.predict(X_test)
accuracy = (preds == y_test).mean()
print(f"Accuracy: {accuracy}")

Regularization

Applying regularization can help prevent overfitting:

alpha (reg_alpha): L1 regularization term on weights.
lambda (reg_lambda): L2 regularization term on weights.

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with regularization
model = xgb.XGBClassifier(reg_alpha=0.1, reg_lambda=1.0)
model.fit(X_train, y_train)

# Evaluate model
preds = model.predict(X_test)
accuracy = (preds == y_test).mean()
print(f"Accuracy: {accuracy}")

Learning Rate Scheduling

Adjusting the learning rate during training can lead to better convergence.

This can be achieved using a learning rate schedule.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the learning rate schedule
def custom_learning_rate(current_iter):
    base_learning_rate = 0.1
    lr = base_learning_rate * (0.95  (current_iter // 10))
    return lr

# Create LearningRateScheduler callback
lr_scheduler = xgb.callback.LearningRateScheduler(custom_learning_rate)

# Create the XGBoost classifier with the learning rate schedule
model = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42,
    callbacks=[lr_scheduler]
)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

By creating new features, tuning hyperparameters, combining multiple models, using regularization, and implementing early stopping, you can significantly improve your model’s performance compared to a baseline model.