The Titanic dataset is a widely used dataset for learning machine learning, hosted on the Kaggle website.
It involves developing a model that predicts which passengers survived the Titanic shipwreck.
Download the Training Dataset
The first step is to download the train.csv
training dataset from the competition website.
This will require you to create an account and sign-in before you can access the dataset.
We may have to accept the competition rules.
Finally, we can download the train.csv
file from the data page:
XGBoost Example
Next, we can address the dataset with XGBoost.
In this example, we’ll download the training dataset, perform hyperparameter tuning using GridSearchCV
with common XGBoost parameters, save the best model, load it, and use it to make predictions on the test set.
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from collections import Counter
# Load the training dataset
dataset = pd.read_csv('train.csv')
# Drop the "PassengerId" column
dataset = dataset.drop('PassengerId', axis=1)
# Drop the "Name" column
dataset = dataset.drop('Name', axis=1)
# Drop the "Ticket" column
dataset = dataset.drop('Ticket', axis=1)
# Drop the "Cabin" column
dataset = dataset.drop('Cabin', axis=1)
# Split into input and output elements
X, y = dataset.values[:,1:], dataset.values[:,0]
# Encode class label as integer values
y = y.astype('int')
# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Input Features: {dataset.columns[1:]}")
print(f"Target variable: {dataset.columns[0]}")
print(f"Class distributions: {Counter(y)}")
# Ordinal encode categorical input features
nom = [1, 6]
transformer = ColumnTransformer(transformers=[('ord', OrdinalEncoder(), nom)], remainder='passthrough')
# Perform ordinal encoding
X = transformer.fit_transform(X)
# Split into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.5, random_state=42, stratify=y)
# Compute the positive class weight
class_weight = (len(y_train) - np.sum(y_train)) / np.sum(y_train)
# Define parameter grid
param_grid = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.01, 0.05],
'n_estimators': [100, 200, 300],
'subsample': [0.8, 1.0],
'reg_alpha': [0, 0.1, 1],
'reg_lambda': [0, 0.1, 1],
'colsample_bytree': [0.8, 1.0],
'scale_pos_weight': [1, class_weight]
}
# Create XGBClassifier
model = XGBClassifier(objective='binary:logistic', random_state=42, n_jobs=1)
# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)
# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")
# Access best model
best_model = grid_search.best_estimator_
# Save best model
best_model.save_model('best_model_titanic.ubj')
# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_titanic.ubj')
# Use loaded model for predictions
predictions = loaded_model.predict(X_valid)
# Print accuracy score
accuracy = accuracy_score(y_valid, predictions)
print(f"Accuracy: {accuracy:.3f}")
The dataset contains 891 samples and 10 input features.
The target variable is the binary classification label for survive (1) or not survive (0).
First, we drop a number of columns that contain non-predictive or complex values that might require feature engineering.
Next, we encode categorical input variables.
We split the data into train and validation sets, define a parameter grid for hyperparameter tuning, create an XGBClassifier
, and perform a grid search with 3-fold cross-validation.
We print the best score and corresponding best parameters, access and save the best model, load the saved model, use it to make predictions on the validation set, and print the classification accuracy.
Running this code will download the dataset, perform the grid search, and output results similar to:
Dataset shape: (891, 7)
Input Features: Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], dtype='object')
Target variable: Survived
Class distributions: Counter({0: 549, 1: 342})
Best score: 0.827
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 200, 'reg_alpha': 1, 'reg_lambda': 0.1, 'scale_pos_weight': 1, 'subsample': 0.8}
Accuracy: 0.818
This example demonstrates how to use XGBoost on a large-scale Kaggle dataset, perform hyperparameter tuning, save and load the best model, and evaluate its performance.