XGBoost for the Abalone Age Dataset

The Abalone dataset is a well-known dataset used for regression tasks, where the goal is to predict the age of abalone from physical measurements.

It contains information about abalones, such as their sex, length, diameter, height, and various weights, with the target variable being the number of rings, which is a proxy for age.

In this example, we’ll load the Abalone dataset using fetch_openml from scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBRegressor
import numpy as np

# Load the Abalone dataset
abalone = fetch_openml('abalone', as_frame=True)
X, y = abalone.data, abalone.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {abalone.feature_names}")
print(f"Target variable: {abalone.target_names}")

# Encode categorical variables
nominal = ['Sex']
transformer = ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), nominal)], remainder='passthrough')
# Perform ordinal encoding
X = transformer.fit_transform(X)

# Convert target to numeric
y = y.astype(float)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBRegressor
model = XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=-1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {-grid_search.best_score_:.3f} MSE")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_abalone.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_abalone.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print mean squared error
mse = np.mean((predictions - y_test) ** 2)
print(f"Mean Squared Error: {mse:.3f}")

Running the example, you will see results similar to the following:

Dataset shape: (4177, 8)
Features: ['Sex', 'Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight', 'Viscera_weight', 'Shell_weight']
Target variable: ['Class_number_of_rings']
Best score: 4.607 MSE
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1.0}
Mean Squared Error: 4.953

In this example, we load the Abalone dataset using fetch_openml from scikit-learn. We print key information about the dataset, including its shape, feature names, and target variable.

We convert the target variable to numeric type and split the data into train and test sets.

Next, we define a parameter grid for hyperparameter tuning and create an instance of XGBRegressor. We perform a grid search using GridSearchCV with 3-fold cross-validation, fit the grid search object, and print the best score (negative mean squared error) and corresponding best parameters.

We access the best model using best_estimator_, save it to a file named ‘best_model_abalone.ubj’, and demonstrate loading the saved model using load_model().

Finally, we use the loaded model to make predictions on the test set and print the mean squared error.

By following this approach, you can easily perform hyperparameter tuning on the Abalone dataset using XGBoost, save the best model, and use it for making predictions.

See Also