XGBoost Train Multiple Models in Parallel (threading)

Training multiple XGBoost models on different datasets or with different hyperparameters can be time-consuming when done sequentially.

However, by leveraging Python’s ThreadPoolExecutor, you can train multiple models in parallel, potentially reducing the overall training time significantly.

To achieve optimal parallelization, it’s crucial to disable BLAS threads and set n_jobs=1 (or a small number) for each model to avoid contention.

This example demonstrates how to train multiple XGBoost models in parallel using ThreadPoolExecutor and compares the execution time against sequential training.

import os
# Avoid contention between processes
os.environ['OMP_NUM_THREADS'] = '1'
import numpy as np
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from concurrent.futures import ThreadPoolExecutor
import time

# List of hyperparameter configurations
def get_params(n_jobs):
    return [
        {'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1, 'n_jobs': n_jobs},
        {'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.05, 'n_jobs': n_jobs},
        {'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.08, 'n_jobs': n_jobs},
        {'n_estimators': 180, 'max_depth': 3, 'learning_rate': 0.12, 'n_jobs': n_jobs},
    ]

# Train single XGBoost model
def train_model(params):
    global X, y
    model = XGBClassifier(**params)
    model.fit(X, y)

# Sequential model training
def train_sequential(param_sets):
    for params in param_sets:
        train_model(params)

# Parallel model training using multiprocessing
def train_parallel(param_sets):
    with ThreadPoolExecutor(4) as p:
        _ = [p.submit(train_model, ps) for ps in param_sets]

# Generate synthetic classification dataset
X, y = make_classification(n_samples=1000000, n_features=20, random_state=42)

# Time the sequential training
start_sequential = time.perf_counter()
train_sequential(get_params(4))
end_sequential = time.perf_counter()
print(f"Sequential training time: {end_sequential - start_sequential:.2f} seconds")

# Time the parallel training
start_parallel = time.perf_counter()
train_parallel(get_params(2))
end_parallel = time.perf_counter()
print(f"Parallel training time: {end_parallel - start_parallel:.2f} seconds")

# Calculate speedup
speedup = (end_sequential - start_sequential) / (end_parallel - start_parallel)
print(f"Parallel training is {speedup:.2f} times faster than sequential training")

You may see output that looks like the following:

Sequential training time: 17.81 seconds
Parallel training time: 13.56 seconds
Parallel training is 1.31 times faster than sequential training

The specific speedup factor will depend on the system where the code is run.

Here’s what’s happening:

We configure BLAS to be single-threaded via the 'OMP_NUM_THREADS' environment variable.
We generate a synthetic dataset using sklearn.datasets.make_classification.
We define a function train_model that takes hyperparameters as input and returns a trained XGBClassifier model.
We create a list of different hyperparameter configurations to train models with a configurable number of threads (n_jobs).
We define two functions: train_sequential for sequential model training and train_parallel for parallel model training using ThreadPoolExecutor.
We time the execution of sequential model training with n_jobs=4 for each model and parallel model training with n_jobs=2 for each model but 4 models trained at a time.
We print the execution times and the speedup achieved with parallel training.

The train_parallel function uses ThreadPoolExecutor to distribute the model training workload across multiple threads. The submit function issues a task to the process pool, and the max_workers parameter specifies the number of threads to use.

Vary the number of worker threads used by ThreadPoolExecutor and the number of threads set via n_jobs for the sequential and parallel functions to optimize performance on your system.

See Also