XGBoost for the Horse Colic Dataset

The Horse Colic dataset is a well-known dataset used for predicting the outcome of horses with colic, a common digestive disorder. It contains various features about the horses’ condition and medical attributes, with the goal of predicting whether the horse will survive, die, or be euthanized.

In this example, we’ll load the Horse Colic dataset using fetch_openml from scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

# Load the Horse Colic dataset
colic = fetch_openml('colic', as_frame=True)
X, y = colic.data, colic.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {colic.feature_names}")
print(f"Target variable: {colic.target_names}")
print(f"Class distributions: {Counter(y)}")

# Retrieve raw values
X = X.values

# Ensure class numbers start at 0
y = y.values.astype('int') - 1


# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='multi:softmax', num_class=3, n_jobs=1, random_state=42)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_horse_colic.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_horse_colic.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Running the example, you will see results similar to the following:

Dataset shape: (368, 26)
Features: ['surgery', 'Age', 'rectal_temperature', 'pulse', 'respiratory_rate', 'temperature_of_extremities', 'peripheral_pulse', 'mucous_membranes', 'capillary_refill_time', 'pain', 'peristalsis', 'abdominal_distension', 'nasogastric_tube', 'nasogastric_reflux', 'nasogastric_reflux_PH', 'rectal_examination_-_feces', 'abdomen', 'packed_cell_volume', 'total_protein', 'abdominocentesis_appearance', 'abdomcentesis_total_protein', 'outcome', 'site_of_lesion', 'type_of_lesion', 'subtype_of_lesion', 'pathology_cp_data']
Target variable: ['surgical_lesion']
Class distributions: Counter({'1': 232, '2': 136})
Best score: 0.878
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 50, 'subsample': 1.0}
Accuracy: 0.878

In this example, we load the Horse Colic dataset using fetch_openml from scikit-learn. We print key information about the dataset, including its shape, feature names, and target variable.

We preprocess the data by handling missing values using median imputation and encoding categorical features with an OrdinalEncoder. The target variable is encoded using a LabelEncoder.

Next, we define a parameter grid for hyperparameter tuning and create an instance of XGBClassifier with the appropriate objective and number of classes. We perform a grid search using GridSearchCV with 3-fold cross-validation, fit the grid search object, and print the best score and corresponding best parameters.

We access the best model using best_estimator_, save it to a file named ‘best_model_horse_colic.ubj’, and demonstrate loading the saved model using load_model().

Finally, we use the loaded model to make predictions on the test set and print the accuracy score.

By following this approach, you can easily perform hyperparameter tuning on the Horse Colic dataset using XGBoost, save the best model, and use it for making predictions on the outcome of horses with colic.

See Also