XGBoost for the Kaggle Bank Churn Dataset

The Bank Customer Churn Prediction Dataset involves predicting churn in the banking industry.

The dataset is hosted on the Kaggle website and is popular task for imbalanced classification.

Download the Training Dataset

The first step is to download the Churn_Modelling.csv data file from the competition website.

This will require you to create an account and sign-in before you can access the dataset.

Bank Customer Churn Prediction

XGBoost Example

Next, we can address the dataset with XGBoost.

In this example, we’ll download the training dataset, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions on the test set.

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import f1_score
from xgboost import XGBClassifier
from collections import Counter

# Load the training dataset
dataset = pd.read_csv('Churn_Modelling.csv')

# Print key information about the dataset
print(f"Dataset shape: {dataset.shape}")
print(f"Features: {dataset.columns[:-1]}")
print(f"Target variable: {dataset.columns[-1]}")
print(f"Class distributions: {Counter(dataset.values[:,-1])}")

# Split into input and output elements
X, y = dataset.values[:,:-1], dataset.values[:,-1]

# drop "id" and "CustomerId"
X = X[:, 2:]

# Encode categorical variables
nom = [0, 2, 3]
transformer = ColumnTransformer(transformers=[('ord', OrdinalEncoder(), nom)], remainder='passthrough')
# Perform ordinal encoding
X = transformer.fit_transform(X)

# Ensure class labels are integers
y = y.astype('int')

# Split into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Compute the positive class weight
class_weight = (len(y_train) - np.sum(y_train)) / np.sum(y_train)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic', scale_pos_weight=class_weight, random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='f1', cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best F1 score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_churn.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_churn.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_valid)

# Print score
f1 = f1_score(y_valid, predictions)
print(f"F1: {f1:.3f}")

The dataset contains 10,002 samples and 14 features. The target variable is the binary classification label.

We split the data into train and validation sets, define a parameter grid for hyperparameter tuning, create an XGBClassifier, and perform a grid search with 3-fold cross-validation and optimize for the F1 metric.

We print the best F1 score and corresponding best parameters, access and save the best model, load the saved model, use it to make predictions on the validation set, and print the F1 score.

Running this code will download the dataset, perform the grid search, and output results similar to:

Dataset shape: (10002, 14)
Features: Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary'],
      dtype='object')
Target variable: Exited
Class distributions: Counter({0: 7964, 1: 2038})
Best F1 score: 0.607
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
F1: 0.610

This example demonstrates how to use XGBoost on a Kaggle dataset, perform hyperparameter tuning, save and load the best model, and evaluate its performance.

Download the Training Dataset

XGBoost Example

See Also