Python’s Global Interpreter Lock (GIL) is a mechanism that prevents multiple native threads from executing Python bytecodes simultaneously, which can limit parallelism.
However, XGBoost is designed to release the GIL when performing computationally intensive tasks in its native code, allowing it to efficiently utilize multiple threads for training and inference even when called from Python.
This example demonstrates the speedup achieved by training XGBoost models in parallel using Python’s ThreadPoolExecutor
.
By leveraging XGBoost’s ability to release the GIL, we can significantly reduce the training time compared to sequential training.
XGBoost Releases the GIL During Training
import os
os.environ['OMP_NUM_THREADS'] = '1' # Avoid contention between threads
import numpy as np
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from concurrent.futures import ThreadPoolExecutor
import time
# Generate a synthetic dataset for a classification problem
X, y = make_classification(n_samples=100000, n_features=20, random_state=42)
# Define a function to train a single XGBoost model
def train_model(params):
model = XGBClassifier(**params)
model.fit(X, y)
# Create a list of different hyperparameter configurations
param_sets = [
{'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1, 'n_jobs': 1},
{'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.05, 'n_jobs': 1},
{'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.08, 'n_jobs': 1},
{'n_estimators': 180, 'max_depth': 3, 'learning_rate': 0.12, 'n_jobs': 1},
]
# Train models sequentially
def train_sequential():
for params in param_sets:
train_model(params)
# Train models in parallel using ThreadPoolExecutor
def train_parallel():
with ThreadPoolExecutor(max_workers=4) as executor:
executor.map(train_model, param_sets)
# Time the sequential training
start_sequential = time.perf_counter()
train_sequential()
end_sequential = time.perf_counter()
print(f"Sequential training time: {end_sequential - start_sequential:.2f} seconds")
# Time the parallel training
start_parallel = time.perf_counter()
train_parallel()
end_parallel = time.perf_counter()
print(f"Parallel training time: {end_parallel - start_parallel:.2f} seconds")
# Print the speedup achieved with parallel training
speedup = (end_sequential - start_sequential) / (end_parallel - start_parallel)
print(f"Parallel training is {speedup:.2f} times faster than sequential training")
In this code:
- We set
OMP_NUM_THREADS
to 1 to avoid contention between processes. - We generate a synthetic dataset for a classification problem using
make_classification
from scikit-learn. - We define
train_model
, a function that takes hyperparameters and trains an XGBoost model. - We create
param_sets
, a list of hyperparameter configurations to use for training. - We define
train_sequential
to train the models one after the other. - We define
train_parallel
to train the models concurrently usingThreadPoolExecutor
. - We time the sequential and parallel training and print the execution times.
- We print the speedup achieved with parallel training.
The key takeaway is that parallel training can be significantly faster than sequential training, even though we’re using Python threads. This is possible because XGBoost releases the GIL when doing the heavy lifting in its native code, allowing true parallelism.
XGBoost Releases the GIL During Inference
import os
os.environ['OMP_NUM_THREADS'] = '1' # Avoid contention between threads
import numpy as np
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from concurrent.futures import ThreadPoolExecutor
import time
# Generate a synthetic dataset for a classification problem
X, y = make_classification(n_samples=1000000, n_features=20, random_state=42)
# Train multiple XGBoost models with different hyperparameters
models = [
XGBClassifier(n_estimators=100, max_depth=3, learning_rate=0.1, n_jobs=1).fit(X, y),
XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.05, n_jobs=1).fit(X, y),
XGBClassifier(n_estimators=150, max_depth=5, learning_rate=0.08, n_jobs=1).fit(X, y),
XGBClassifier(n_estimators=180, max_depth=3, learning_rate=0.12, n_jobs=1).fit(X, y),
]
# Define a function to make predictions with a single model
def predict_with_model(model):
model.predict(X)
# Make predictions sequentially
def predict_sequential():
for model in models:
predict_with_model(model)
# Make predictions in parallel using ThreadPoolExecutor
def predict_parallel():
with ThreadPoolExecutor(max_workers=4) as executor:
executor.map(predict_with_model, models)
# Time the sequential predictions
start_sequential = time.perf_counter()
predict_sequential()
end_sequential = time.perf_counter()
print(f"Sequential prediction time: {end_sequential - start_sequential:.2f} seconds")
# Time the parallel predictions
start_parallel = time.perf_counter()
predict_parallel()
end_parallel = time.perf_counter()
print(f"Parallel prediction time: {end_parallel - start_parallel:.2f} seconds")
# Print the speedup achieved with parallel prediction
speedup = (end_sequential - start_sequential) / (end_parallel - start_parallel)
print(f"Parallel prediction is {speedup:.2f} times faster than sequential prediction")
Running this code, you might see output similar to:
Sequential prediction time: 6.17 seconds
Parallel prediction time: 2.17 seconds
Parallel prediction is 2.85 times faster than sequential prediction
Here’s what’s happening in the code:
- We set
OMP_NUM_THREADS
to 1 to avoid contention between processes. - We generate a synthetic dataset using
make_classification
from scikit-learn. - We train multiple XGBoost models with different hyperparameters.
- We define
predict_with_model
, a function that takes a model and makes predictions on the dataset. - We define
predict_sequential
to make predictions with each model sequentially. - We define
predict_parallel
to make predictions with the models concurrently usingThreadPoolExecutor
. - We time the sequential and parallel predictions and print the execution times.
- We print the speedup achieved with parallel prediction.
The key observation is that parallel prediction is significantly faster than sequential prediction. This speedup is possible because XGBoost releases the GIL when executing the computationally intensive prediction code in its native backend, allowing the Python threads to run concurrently.
Keep in mind that the specific speedup will vary depending on factors like your CPU, the size of your dataset, and the hyperparameters used. Nonetheless, this example showcases XGBoost’s ability to efficiently leverage multi-threading even in a Python context, thanks to its smart handling of the GIL.