XGBoosting Home | About | Contact | Examples

XGBoost for the Kaggle Otto Group Product Classification Dataset

The Otto Group Product Classification Challenge was a Kaggle competition to predict the category of products based on 93 anonymous features.

Download the Training Dataset

The first step is to download the train.csv training dataset from the competition website.

This will require you to create an account and sign-in before you can access the dataset.

We must accept the competition rules.

Finally, we can download the train.csv.zip file that contains the train.csv data set.

XGBoost Example

Next, we can address the dataset with XGBoost.

This example will demonstrate how to download the training dataset, explore the data, perform hyperparameter tuning with XGBoost, save the best model, load it, and use it to make predictions.

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from collections import Counter

# Load the training dataset
otto = pd.read_csv('train.csv')

# Split into input and output elements and drop "id"
X, y = otto.iloc[:,1:-1], otto.iloc[:,-1]

# Encode the target variable
y = LabelEncoder().fit_transform(y)

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Number of features: {X.shape[1]}")
print(f"Target variable: {otto.columns[-1]}")
print(f"Class distributions: {Counter(y)}")

# Split into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='multi:softprob', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_otto.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_otto.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_valid)

# Print accuracy score
accuracy = accuracy_score(y_valid, predictions)
print(f"Accuracy: {accuracy:.3f}")

Load the dataset using pandas. The dataset contains 61,878 samples and 93 anonymous features. The target variable is the product category, which has 9 possible classes. We encode the target variable, then split the data into train and validation sets.

Next, we define a parameter grid with common XGBoost hyperparameters, create an XGBClassifier with the multi:softprob objective for multiclass classification, and perform a grid search with 3-fold cross-validation to find the best parameters.

After printing the best score and parameters, we access the best model, save it to disk, load the saved model, use it to make predictions on the validation set, and print the multiclass classification accuracy.

Running this code will output results similar to:

Dataset shape: (61878, 93)
Number of features: 93
Target variable: target
Class distributions: Counter({1: 16122, 5: 14135, 7: 8464, 2: 8004, 8: 4955, 6: 2839, 4: 2739, 3: 2691, 0: 1929})
Best score: 0.809
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.8}
Accuracy: 0.816

This example demonstrates the end-to-end process of using XGBoost on a Kaggle dataset, from downloading the data to making predictions with a tuned model.



See Also