XGBoost for the Wine Dataset

The Wine dataset is a classic dataset for classification tasks, available in scikit-learn. It contains 178 samples with 13 features describing chemical properties of wines, and a target variable indicating one of three cultivars (classes).

In this example, we’ll load the Wine dataset, split it into train and test sets, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

# Load the Covertype dataset
covtype = fetch_covtype()
X, y = covtype.data, covtype.target

# ensure class numbers start at 0
y = y - 1

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Classes: {np.unique(y)}")
print(f"Class Distributions: {Counter(y)}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='multi:softmax', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_covtype.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_covtype.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Running the example produces output similar to:

Dataset shape: (581012, 54)
Classes: [0 1 2 3 4 5 6]
Class Distributions: Counter({1: 283301, 0: 211840, 2: 35754, 6: 20510, 5: 17367, 4: 9493, 3: 2747})
Best score: 0.979
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8}
Accuracy: 1.000

The example first loads the Wine dataset and prints its shape, feature names, and unique class labels. The data is then split into train and test sets.

Next, a parameter grid is defined with common XGBoost hyperparameters. An XGBClassifier is instantiated and a grid search with 3-fold cross-validation is performed to find the best hyperparameters.

The best model is accessed, saved, then loaded to demonstrate model persistence. Finally, the loaded model is used to make predictions on the test set, and the accuracy score is printed.

This example showcases how to use XGBoost for multiclass classification on the Wine dataset, including hyperparameter tuning, model saving and loading, and evaluation.

See Also