XGBoost for the Diabetes Dataset

The diabetes dataset is a well-known dataset for regression tasks, containing various physiological measurements and a target variable representing a quantitative measure of disease progression one year after baseline.

In this example, we’ll load the diabetes dataset from scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost regression parameters, save the best model, load it, and use it to make a prediction on a sample data point.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBRegressor
import numpy as np

# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {diabetes.feature_names}")
print(f"Target: {diabetes.DESCR.splitlines()[1]}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBRegressor
model = XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_diabetes.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_diabetes.ubj')

# Create a sample data point
sample_data_point = np.array([[0.03807591, 0.05068012, 0.06169621, 0.02187235, -0.0442235,
                               -0.03482076, -0.04340085, -0.00259226, 0.01990842, -0.01764613]])

# Use loaded model for prediction
prediction = loaded_model.predict(sample_data_point)
print(f"Predicted value: {prediction[0]:.3f}")

Running the example, you will see results like the following:

Dataset shape: (442, 10)
Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
Target:
Best score: 0.438
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Predicted value: 221.751

In this example, we first load the diabetes dataset using load_diabetes() from scikit-learn. We print some key information about the dataset, such as its shape, feature names, and a description of the target variable.

Next, we split the data into train and test sets using train_test_split(). We define a parameter grid with common XGBoost regression hyperparameters.

We create an instance of XGBRegressor with the objective set to ‘reg:squarederror’ for regression tasks, and perform a grid search using GridSearchCV with 3-fold cross-validation. After fitting the grid search object, we print the best score and corresponding best parameters.

We access the best model using best_estimator_, save it to a file named ‘best_model_diabetes.ubj’ using save_model(), and load it using load_model().

Finally, we create a sample data point and use the loaded model to make a prediction. The predicted value is then printed.

This example demonstrates how to apply XGBoost to the diabetes dataset, perform hyperparameter tuning, save and load the best model, and use it for making predictions on new data points.

See Also