The Cleveland Heart Disease dataset is a popular dataset for binary classification, containing features related to diagnosing heart disease.
In this example, we’ll load the dataset using scikit-learn’s fetch_openml
, perform hyperparameter tuning using GridSearchCV
with common XGBoost parameters, save the best model, load it, and use it to make a prediction on a sample datapoint.
from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier
import numpy as np
from collections import Counter
# Load the Cleveland Heart Disease dataset
X, y = fetch_openml("heart-disease", return_X_y=True, target_column='target', as_frame=True)
# Mark missing as nan
X = X.fillna(value=np.nan)
# Convert target to integers
y = y.astype('int')
# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {X.columns.tolist()}")
print(f"Class distributions: {Counter(y)}")
# Retrieve values
X = X.values
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Define parameter grid
param_grid = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.01, 0.05],
'n_estimators': [50, 100, 200],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0]
# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic', random_state=42, n_jobs=1)
# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1), y_train)
# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")
# Access best model
best_model = grid_search.best_estimator_
# Save best model
# Load saved model
loaded_model = XGBClassifier()
# Use loaded model for predictions
predictions = loaded_model.predict(X_test)
# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")
Running this example will yield output similar to:
Dataset shape: (303, 13)
Features: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
Class distributions: Counter({1: 165, 0: 138})
Best score: 0.843
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 50, 'subsample': 0.8}
Accuracy: 0.820
This example demonstrates loading the Cleveland Heart Disease dataset, performing hyperparameter tuning with XGBoost, saving and loading the best model, and using it to make a prediction on a sample datapoint.