XGBoost for the Kaggle House Prices Dataset

The House Prices prediction dataset involves predicting the sale price of houses based on many details about each house.

It is a getting started competition designed to teach predictive modeling.

The dataset is hosted on the Kaggle website and is popular task for regression.

Download the Training Dataset

The first step is to download the train.csv data file from the competition website.

This will require you to create an account and sign-in before you can access the dataset.

House Prices - Advanced Regression Techniques

XGBoost Example

Next, we can address the dataset with XGBoost.

In this example, we’ll download the training dataset, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions on the test set.

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_log_error
from xgboost import XGBRegressor

# Load the training dataset
dataset = pd.read_csv('train.csv')

# Drop the "id" column
dataset = dataset.drop('Id', axis=1)

# Print key information about the dataset
print(f"Dataset shape: {dataset.shape}")
print(f"Features: {dataset.columns[:-1]}")
print(f"Target variable: {dataset.columns[-1]}")

# Split into input and output elements
X, y = dataset.values[:,:-1], dataset.values[:,-1]

# Enumerate columns and ordinal encode categorical input features
for idx, col in enumerate(dataset.columns[:-1]):
    if dataset[col].dtype == 'object':
        X[:,idx] = OrdinalEncoder().fit_transform([X[:,idx]])

# Split into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [100, 500, 1000],
    'subsample': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 0.001, 1],
    'colsample_bytree': [0.8, 1.0]
}

# Create regressor
model = XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=2)

# Perform grid search
grid_search = GridSearchCV(estimator=model,
                           param_grid=param_grid,
                           scoring='neg_mean_squared_log_error',
                           cv=3,
                           n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
score  = np.sqrt(abs(grid_search.best_score_))
print(f"Best RMSLE score: {score:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_house_prices.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_house_prices.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_valid)

# Print score
msle = mean_squared_log_error(y_valid, predictions)
print(f"RMSLE: {np.sqrt(msle):.3f}")

The dataset contains 1,460 samples and 80 features. The target variable is a numerical sale price value.

We first drop the id column then ordinal encode all categorical input variables.

Next, we split the data into train and test sets, define a parameter grid for hyperparameter tuning, create an XGBRegressor, and perform a grid search with 3-fold cross-validation and optimize for the Mean Squared Log Error (MSLE) metric.

We print the best MSLE score and corresponding best parameters, access and save the best model, load the saved model, use it to make predictions on the validation set, and print the RMSLE score.

Running this code will download the dataset, perform the grid search, and output results similar to:

Dataset shape: (1460, 80)
Features: Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition'],
      dtype='object')
Target variable: SalePrice
Best RMSLE score: 0.132
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 1000, 'reg_alpha': 1, 'subsample': 0.8}
RMSLE: 0.139

This example demonstrates how to use XGBoost on a Kaggle dataset, perform hyperparameter tuning, save and load the best model, and evaluate its performance.

Download the Training Dataset

XGBoost Example

See Also