XGBoost for Multiple-Output Regression Manually

When faced with a multiple output regression problem (multi-out regression), where the goal is to predict several continuous target variables simultaneously, one approach is to train a separate XGBoost model for each target variable.

While XGBoost does have modest natively support multiple output regression, this manual approach allows for greater flexibility compared to using a wrapper like MultiOutputRegressor from scikit-learn, albeit at the cost of writing more code.

This example demonstrates how to manually train multiple XGBoost models, one for each target variable, to solve a multiple output regression task.

We’ll generate a synthetic dataset, prepare the data, initialize and train the models, make predictions, and evaluate the overall performance.

# XGBoosting.com
# Manually train separate XGBoost models for each target in multiple output regression
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate a synthetic multi-output regression dataset
X, y = make_regression(n_samples=1000,
                       n_features=10,
                       n_targets=3,
                       noise=0.1,
                       random_state=42,
                       n_informative=5)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a list to store the trained models
models = []

# Loop through each target variable
for i in range(y_train.shape[1]):
    # Select the current target variable
    y_train_i = y_train[:, i]

    # Initialize an XGBRegressor for the current target
    model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

    # Fit the XGBRegressor on the training data for the current target
    model.fit(X_train, y_train_i)

    # Append the trained model to the list of models
    models.append(model)

# Make predictions by predicting each target separately using the corresponding model
y_pred = np.column_stack([model.predict(X_test) for model in models])

# Evaluate the overall performance using mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")

Here’s a step-by-step breakdown:

Generate a synthetic multi-output regression dataset with 10 input features and 3 output targets.
Split the data into training and testing sets using train_test_split.
Initialize an empty list called models to store the trained XGBoost models.
Loop through each target variable:
- Select the current target variable from the training data.
- Initialize an XGBRegressor with chosen hyperparameters for the current target.
- Fit the XGBRegressor on the training data for the current target using fit().
- Append the trained model to the models list.
Make predictions on the test set by predicting each target separately using the corresponding model and combining the results into a single array using np.column_stack().
Evaluate the overall performance using Mean Squared Error (MSE).

By manually training separate XGBoost models for each target variable, you have full control over the training process and can potentially achieve better performance than using a generic wrapper. However, this approach requires more code and may not be as convenient as using a pre-built solution like MultiOutputRegressor.

This example provides a foundation for training XGBoost models for multiple output regression tasks manually. Depending on your specific dataset and requirements, you may need to preprocess the data, tune hyperparameters, or use different evaluation metrics to optimize performance.

See Also