XGBoost for the Boston Housing Dataset

The Boston Housing dataset is a well-known dataset for regression tasks, containing information about various housing features in the Boston area, with the goal of predicting the median value of owner-occupied homes.

In this example, we’ll load the Boston Housing dataset using fetch_openml from scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBRegressor
import numpy as np

# Load the Boston Housing dataset
boston = fetch_openml('boston', as_frame=True)
X, y = boston.data, boston.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {boston.feature_names}")
print(f"Target variable: {boston.target_names}")

# Encode categorical variables
nominal = ['CHAS', 'RAD']
transformer = ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), nominal)], remainder='passthrough')
# Perform ordinal encoding
X = transformer.fit_transform(X)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBRegressor
model = XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {-grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_boston.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_boston.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print mean squared error and R-squared score
mse = np.mean((y_test - predictions) ** 2)
r2 = loaded_model.score(X_test, y_test)
print(f"Mean Squared Error: {mse:.3f}")
print(f"R-squared: {r2:.3f}")

Running the example, you will see results similar to the following:

Dataset shape: (506, 13)
Features: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
Target variable: ['MEDV']
Best score: 12.314
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Mean Squared Error: 6.077
R-squared: 0.917

In this example, we load the Boston Housing dataset using fetch_openml from scikit-learn. We print key information about the dataset, including its shape, feature names, and target variable.

Next, we split the data into train and test sets, define a parameter grid for hyperparameter tuning, and create an instance of XGBRegressor. We perform a grid search using GridSearchCV with 3-fold cross-validation, fit the grid search object, and print the best score and corresponding best parameters.

We access the best model using best_estimator_, save it to a file named ‘best_model_boston.ubj’, and demonstrate loading the saved model using load_model().

Finally, we use the loaded model to make predictions on the test set and print the mean squared error (MSE) and R-squared score to evaluate the model’s performance.

By following this approach, you can easily perform hyperparameter tuning on the Boston Housing dataset using XGBoost, save the best model, and use it for making predictions in regression tasks.

See Also