XGBoost for the Wisconsin Breast Cancer Dataset

The Wisconsin Breast Cancer dataset is a binary classification dataset available in scikit-learn. It consists of 569 samples, with 30 numeric features computed from digitized images of breast mass, and a binary target indicating whether the mass is malignant or benign.

In this example, we’ll load the Breast Cancer dataset, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Number of features: {len(data.feature_names)}")
print(f"Target names: {data.target_names}")
print(f"Target distributions: {Counter(y)}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Compute scale_pos_weight as ratio of negative to positive instances in train set
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic',
                      scale_pos_weight=scale_pos_weight,
                      random_state=42,
                      n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_breast_cancer.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_breast_cancer.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Running this example will output results similar to:

Dataset shape: (569, 30)
Number of features: 30
Target names: ['malignant' 'benign']
Target distributions: Counter({1: 357, 0: 212})
Best score: 0.963
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Accuracy: 0.956

This example demonstrates how to load the Breast Cancer dataset, perform hyperparameter tuning with XGBoost using GridSearchCV, save the best model, load it, and use it for making predictions.

By following this approach, you can easily find the best hyperparameters for your XGBoost model and use it for binary classification tasks on the Breast Cancer dataset.

See Also