XGBoost for the Linnerud Dataset

The Linnerud dataset is a classic multivariate regression dataset. It consists of three exercise variables and three physiological variables, with a total of 20 observations.

In this example, we’ll load the Linnerud dataset from scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost regression parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import load_linnerud
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor

# Load the Linnerud dataset
linnerud = load_linnerud()
X, y = linnerud.data, linnerud.target

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Features: {linnerud.feature_names}")
print(f"Targets: {linnerud.target_names}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBRegressor
model = XGBRegressor(objective='reg:squarederror',
                     tree_method="hist",
                     multi_strategy="multi_output_tree",
                     random_state=42,
                     n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {-grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_linnerud.ubj')

# Load saved model
loaded_model = XGBRegressor()
loaded_model.load_model('best_model_linnerud.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print mean squared error and R-squared score
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"Mean Squared Error: {mse:.3f}")
print(f"R-squared Score: {r2:.3f}")

Running the example, you will see results like the following:

Dataset shape: (20, 3)
Features: ['Chins', 'Situps', 'Jumps']
Targets: ['Weight', 'Waist', 'Pulse']
Best score: 233.009
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Mean Squared Error: 390.026
R-squared Score: -1.877

In this example, we first load the Linnerud dataset using load_linnerud() from scikit-learn. We print some key information about the dataset, such as its shape, feature names, and target names.

Next, we split the data into train and test sets using train_test_split(). We define a parameter grid with common XGBoost regression hyperparameters.

We create an instance of XGBRegressor for multiple-output regression (e.g. tree_method="hist" and multi_strategy="multi_output_tree") and perform a grid search using GridSearchCV with 3-fold cross-validation and the negative mean squared error as the scoring metric. After fitting the grid search object, we print the best score and corresponding best parameters.

We access the best model using best_estimator_, save it to a file named ‘best_model_linnerud.ubj’, and demonstrate loading the saved model.

Finally, we use the loaded model to make predictions on the test set and print the mean squared error and R-squared score.

By following this approach, you can easily apply XGBoost to the Linnerud dataset for regression tasks, perform hyperparameter tuning, save and load the best model, and evaluate its performance using appropriate metrics.

See Also