XGBoost Parallel Prediction With a Thread Pool (threading)

When dealing with large datasets, making predictions with your trained XGBoost model can be time-consuming.

Python’s concurrent.futures.ThreadPoolExecutor allows you to distribute the prediction workload across multiple threads, potentially speeding up the process significantly.

The XGBoost model is thread safe during inference when calling predict(). The CPython interpreter will release the global interpreter lock (GIL) when calling down into third-party C libraries such as NumPy and XGBoost, allowing true parallelism.

NumPy and XGBoost automatically makes use of BLAS (Basic Linear Algebra Subprograms) threads behind the scenes when making predictions. BLAS threads must be disabled before using Python threads to make parallel predictions, avoid contention (too many threads competing with each other at the same time). This can be achieved by setting the 'OMP_NUM_THREADS' environment variable to "1", so that BLAS is singled-threaded.

This example demonstrates how to perform parallel predictions with a Python thread pool and compares the execution time against sequential prediction.

import os
# Fix the number of BLAS threads
os.environ['OMP_NUM_THREADS'] = "1"
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from concurrent.futures import ThreadPoolExecutor
import time

# Generate a large synthetic dataset
X, y = make_classification(n_samples=10000000, n_features=20, random_state=42)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9, random_state=42)

# Train an XGBoost model
model = XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Function for sequential prediction
def predict_sequential(model, X):
    return model.predict(X)

# Function for parallel prediction
def predict_parallel(model, X, n_jobs):
    # Split the dataset into n_jobs parts
    chunks = np.array_split(X, n_jobs)
    # make predictions for each chunk in a separate thread
    with ThreadPoolExecutor(max_workers=n_jobs) as executor:
        predictions = list(executor.map(model.predict, chunks))
    return np.concatenate(predictions)

# Time the sequential prediction
start_sequential = time.perf_counter()
sequential_predictions = predict_sequential(model, X_test)
end_sequential = time.perf_counter()
# Print the execution time
print(f"Sequential prediction time: {end_sequential - start_sequential:.2f} seconds")

# Time the parallel prediction
start_parallel = time.perf_counter()
parallel_predictions = predict_parallel(model, X_test, n_jobs=4)
end_parallel = time.perf_counter()
# Print the execution time
print(f"Parallel prediction time: {end_parallel - start_parallel:.2f} seconds")

# Print the speedup
speedup = (end_sequential - start_sequential) / (end_parallel - start_parallel)
print(f"Parallel prediction is {speedup:.2f} times faster than sequential prediction")

You may see results that look something like the following:

Sequential prediction time: 15.36 seconds
Parallel prediction time: 4.14 seconds
Parallel prediction is 3.71 times faster than sequential prediction

Here’s what’s happening:

We configure BLAS to be single threaded via the 'OMP_NUM_THREADS' environment variable.
We generate a large synthetic dataset using sklearn.datasets.make_classification.
We split the data into train and test sets, with 90% of the data used for testing.
We train an XGBClassifier on the training data.
We define two functions: predict_sequential for sequential prediction and predict_parallel for parallel prediction using ThreadPoolExecutor.
We time the execution of sequential prediction and parallel prediction.
We print the execution times and the speedup achieved with parallel prediction.

The predict_parallel function uses ThreadPoolExecutor to distribute the prediction workload across multiple threads. The map function applies the model.predict method to each chunk of the X dataset, and the max_workers parameter specifies the number of threads to use.

The multithreaded version will be a factor faster than the sequential version, depending on the number of CPU cores in your system, and the value of n_jobs (e.g. about 4x faster with 4 cores).

It is likely that automatically parallelism during prediction with BLAS threads will be faster than manually splitting the prediction task among Python threads. Test and compare performance on your system.

See Also