XGBoosting Home | About | Contact | Examples

XGBoost for the Kaggle Credit Card Fraud Detection Dataset

The Credit Card Fraud Detection involves whether credit card transactions are fraudulent or not.

The dataset is hosted on the Kaggle website and is popular task imbalanced classification.

Download the Training Dataset

The first step is to download the creditcard.csv data file from the competition website.

This will require you to create an account and sign-in before you can access the dataset.

XGBoost Example

Next, we can address the dataset with XGBoost.

In this example, we’ll download the training dataset, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions on the test set.

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
from collections import Counter

# Load the training dataset
dataset = pd.read_csv('creditcard.csv')

# Drop the "Time" column
dataset = dataset.drop('Time', axis=1)

# Split into input and output elements
X, y = dataset.values[:,:-1], dataset.values[:,-1]

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Input features: {dataset.columns[:-1]}")
print(f"Target variable: {dataset.columns[-1]}")
print(f"Class distributions: {Counter(y)}")

# Split into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Compute the positive class weight
class_weight = (len(y_train) - np.sum(y_train)) / np.sum(y_train)

# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic', scale_pos_weight=class_weight, tree_method='hist', random_state=42, n_jobs=2)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='roc_auc', cv=3, n_jobs=4, pre_dispatch=4)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_creditcard.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_creditcard.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_valid)

# Print roc auc score
rocauc = roc_auc_score(y_valid, predictions)
print(f"ROC AUC: {rocauc:.3f}")

The dataset contains 284,807 samples and 29 features.

The target variable is a binary class label for normal transaction (0) and fraud (1).

We first drop the time column then report details about the dataset.

Next, we split the data into train and test sets, define a parameter grid for hyperparameter tuning, create an XGBClassifier, and perform a grid search with 3-fold cross-validation and optimize for the Area under the Receiver Operating Characteristic Curve metric (ROC AUC).

We print the best ROC AUC score and corresponding best parameters, access and save the best model, load the saved model, use it to make predictions on the validation set, and print the ROC AUC score.

Running this code will download the dataset, perform the grid search, and output results similar to:

Dataset shape: (284807, 29)
Input features: Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'],
      dtype='object')
Target variable: Class
Class distributions: Counter({0.0: 284315, 1.0: 492})
Best score: 0.984
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.8}
ROC AUC: 0.932

This example demonstrates how to use XGBoost on a Kaggle dataset, perform hyperparameter tuning, save and load the best model, and evaluate its performance.



See Also