XGBoost for the California Housing Dataset

The California Housing dataset is a classic dataset for regression tasks, often used as a benchmark for new algorithms.

In this example, we’ll load the dataset from scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost regression parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor
import numpy as np

# Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {housing.feature_names}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBRegressor
model = XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {-grid_search.best_score_:.3f} (MSE)")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_housing.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_housing.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print evaluation metrics
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"Mean Absolute Error: {mae:.3f}")
print(f"Mean Squared Error: {mse:.3f}")
print(f"R^2 Score: {r2:.3f}")

Running this example, you might see results like:

Dataset shape: (20640, 8)
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
Best score: 0.214 (MSE)
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300, 'subsample': 1.0}
Mean Absolute Error: 0.298
Mean Squared Error: 0.205
R^2 Score: 0.844

In this example, we load the California Housing dataset using fetch_california_housing() from scikit-learn. We print the dataset shape, feature names, and the target name.

We split the data into train and test sets, define a parameter grid with common XGBoost regression hyperparameters, create an XGBRegressor instance, and perform a grid search with 3-fold cross-validation. We use ’neg_mean_squared_error’ as the scoring metric, which is the negative MSE. GridSearchCV will negate this to get the positive MSE and select the model with the lowest MSE.

After fitting the grid search object, we print the best score (MSE) and corresponding best parameters. We access the best model, save it to a file, load the saved model, and use it to make predictions on the test set.

Finally, we evaluate the model’s performance using mean absolute error (MAE), mean squared error (MSE), and R^2 score.

This example demonstrates how to perform hyperparameter tuning with XGBoost for a regression task, save and load the best model, and evaluate its performance on a classic dataset like the California Housing dataset.

See Also