XGBoost for the Wholesale Customers Dataset

The Wholesale Customers dataset describes clients for a wholesale distributor.

The dataset can be modeled as a binary classification task to predict the customer channel based on spending details.

This dataset is available from the UCI Machine Learning Repository and can be downloaded automatically using the fetch_ucirepo library that can be installed using your preferred Python package manager, such as pip:

pip install fetch_ucirepo

In this example, we’ll load the dataset, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from collections import Counter
from ucimlrepo import fetch_ucirepo

# Fetch the dataset
dataset = fetch_ucirepo(id=292)

# Split into input and output features
X = dataset.data.original.values[:, 1:]
y = dataset.data.original.values[:, 0]

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {dataset.data.headers}")
print(f"Class distributions: {Counter(y)}")

# Encode target variable
y = LabelEncoder().fit_transform(y)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Compute scale_pos_weight as ratio of negative to positive instances in train set
weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic', scale_pos_weight=weight, random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_ionosphere.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_ionosphere.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Running this example will output results similar to:

Dataset shape: (440, 7)
Features: Index(['Channel', 'Region', 'Fresh', 'Milk', 'Grocery', 'Frozen',
       'Detergents_Paper', 'Delicassen'],
      dtype='object')
Class distributions: Counter({1: 298, 2: 142})
Best score: 0.920
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Accuracy: 0.932

This example demonstrates how to load the dataset, perform hyperparameter tuning with XGBoost using GridSearchCV, save the best model, load it, and use it for making predictions.

By following this approach, you can easily find the best hyperparameters for your XGBoost model and use it for prediction tasks.

See Also