XGBoost for the Kaggle Higgs Boson Dataset

The Higgs Boson Machine Learning Challenge was a Kaggle competition to classify events as signal or background based on high-energy physics data from the CERN Large Hadron Collider.

It is an important dataset for XGBoost as it was the competition for which the XGBoost library was announced.

Download the Training Dataset

The first step is to download the training.csv training dataset from the competition website.

This will require you to create an account and sign-in before you can access the dataset.

We must accept the competition rules.

Finally, we can download the training.zip file from the data page that contains the training.csv data set.

Higgs Boson Machine Learning Challenge

XGBoost Example

Next, we can address the dataset with XGBoost.

In this example, we’ll download the training dataset, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions on the test set.

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from collections import Counter

# Load the training dataset
higgs = pd.read_csv('training.csv')

# Split into input and output elements
X, y = higgs.values[:,:-1], higgs.values[:,-1]

# Drop the "EventId" and "Weight" columns from the input
X = X[:, 1:-1]

# Encode the target variable
y = LabelEncoder().fit_transform(y)

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {higgs.columns[1:-2].tolist()}")
print(f"Target variable: {higgs.columns[-1]}")
print(f"Class distributions: {Counter(y)}")

# Split into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Compute the positive class weight
class_weight = (len(y_train) - np.sum(y_train)) / np.sum(y_train)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic', scale_pos_weight=class_weight, random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_higgs.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_higgs.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_valid)

# Print accuracy score
accuracy = accuracy_score(y_valid, predictions)
print(f"Accuracy: {accuracy:.3f}")

The dataset contains 250,000 samples and 30 features. The target variable is the binary classification label. We split the data into train and validation sets, define a parameter grid for hyperparameter tuning, create an XGBClassifier, and perform a grid search with 3-fold cross-validation.

We print the best score and corresponding best parameters, access and save the best model, load the saved model, use it to make predictions on the validation set, and print the classification accuracy.

Running this code will download the dataset, perform the grid search, and output results similar to:

Dataset shape: (250000, 30)
Features: ['DER_mass_MMC', 'DER_mass_transverse_met_lep', 'DER_mass_vis', 'DER_pt_h', 'DER_deltaeta_jet_jet', 'DER_mass_jet_jet', 'DER_prodeta_jet_jet', 'DER_deltar_tau_lep', 'DER_pt_tot', 'DER_sum_pt', 'DER_pt_ratio_lep_tau', 'DER_met_phi_centrality', 'DER_lep_eta_centrality', 'PRI_tau_pt', 'PRI_tau_eta', 'PRI_tau_phi', 'PRI_lep_pt', 'PRI_lep_eta', 'PRI_lep_phi', 'PRI_met', 'PRI_met_phi', 'PRI_met_sumet', 'PRI_jet_num', 'PRI_jet_leading_pt', 'PRI_jet_leading_eta', 'PRI_jet_leading_phi', 'PRI_jet_subleading_pt', 'PRI_jet_subleading_eta', 'PRI_jet_subleading_phi', 'PRI_jet_all_pt']
Target variable: Label
Class distributions: Counter({0: 164333, 1: 85667})
Best score: 0.828
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Accuracy: 0.827

This example demonstrates how to use XGBoost on a large-scale Kaggle dataset, perform hyperparameter tuning, save and load the best model, and evaluate its performance.

Download the Training Dataset

XGBoost Example

See Also