XGBoost for the Cleveland Heart Disease Dataset

The Cleveland Heart Disease dataset is a popular dataset for binary classification, containing features related to diagnosing heart disease.

In this example, we’ll load the dataset using scikit-learn’s fetch_openml, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make a prediction on a sample datapoint.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

# Load the Cleveland Heart Disease dataset
X, y = fetch_openml("heart-disease", return_X_y=True, target_column='target', as_frame=True)

# Mark missing as nan
X = X.fillna(value=np.nan)

# Convert target to integers
y = y.astype('int')

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {X.columns.tolist()}")
print(f"Class distributions: {Counter(y)}")

# Retrieve values
X = X.values

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_heart_disease.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_heart_disease.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Running this example will yield output similar to:

Dataset shape: (303, 13)
Features: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
Class distributions: Counter({1: 165, 0: 138})
Best score: 0.843
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 50, 'subsample': 0.8}
Accuracy: 0.820

This example demonstrates loading the Cleveland Heart Disease dataset, performing hyperparameter tuning with XGBoost, saving and loading the best model, and using it to make a prediction on a sample datapoint.

See Also