XGBoost for the Kaggle House Sales in King County Dataset

The House Sales in King County dataset involves predicting the sale price of houses based on many details about each house.

The dataset is hosted on the Kaggle website and is popular task for regression.

Download the Training Dataset

The first step is to download the kc_house_data.csv data file from the competition website.

This will require you to create an account and sign-in before you can access the dataset.

House Sales in King County, USA

XGBoost Example

Next, we can address the dataset with XGBoost.

In this example, we’ll download the training dataset, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions on the test set.

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import r2_score
from xgboost import XGBRegressor

# Load the training dataset
dataset = pd.read_csv('kc_house_data.csv')

# Drop the "id" column
dataset = dataset.drop('id', axis=1)
# Drop the "date" column
dataset = dataset.drop('date', axis=1)

# Print key information about the dataset
print(f"Dataset shape: {dataset.shape}")
print(f"Input Features: {dataset.columns[1:]}")
print(f"Target variable: {dataset.columns[0]}")

# Split into input and output elements
X, y = dataset.values[:,1:], dataset.values[:,0]

# Split into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create regressor
model = XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='r2', cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best R2 score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_house_sales.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_house_sales.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_valid)

# Print score
r2 = r2_score(y_valid, predictions)
print(f"R2: {r2:.3f}")

The dataset contains 21,613 samples and 19 features.

The target variable is a numerical sale price value.

We first drop the id and date columns.

Next, we split the data into train and test sets, define a parameter grid for hyperparameter tuning, create an XGBRegressor, and perform a grid search with 3-fold cross-validation and optimize for the R^2 (R-squared) metric.

We print the best R^2 score and corresponding best parameters, access and save the best model, load the saved model, use it to make predictions on the validation set, and print the R^2 score.

Running this code will download the dataset, perform the grid search, and output results similar to:

Dataset shape: (21613, 19)
Input Features: Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')
Target variable: price
Best R2 score: 0.897
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.8}
R2: 0.884

This example demonstrates how to use XGBoost on a Kaggle dataset, perform hyperparameter tuning, save and load the best model, and evaluate its performance.

Download the Training Dataset

XGBoost Example

See Also