XGBoost Configure "n_jobs" for Random Search

When using XGBoost with RandomizedSearchCV for hyperparameter tuning, the n_jobs parameter can be leveraged to parallelize computations across multiple CPU cores, potentially speeding up the process.

However, finding the optimal configuration for n_jobs in both XGBoost and RandomizedSearchCV requires some experimentation, as the available resources need to be divided between model training and the search itself.

This example demonstrates how to set n_jobs for both XGBoost and RandomizedSearchCV, and compares the execution times of different configurations to help find the most efficient setup for your specific use case.

# XGBoosting.com
# XGBoost Set n_jobs for RandomizedSearchCV
import os
# Set environment variable to limit OpenMP to 1 thread
os.environ["OMP_NUM_THREADS"] = "1"
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from xgboost import XGBClassifier
import time
import multiprocessing

# Get the number of available CPU cores
n_cores = multiprocessing.cpu_count()

# Generate a synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter distribution for randomized search
param_dist = {
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.1, 0.01, 0.001],
    'subsample': [0.8, 1.0]
}

# Function to train the model and perform randomized search
def train_and_tune(model_n_jobs, random_search_n_jobs):
    model = XGBClassifier(n_estimators=100, n_jobs=model_n_jobs, random_state=42)
    random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=3, n_jobs=random_search_n_jobs)

    start_time = time.perf_counter()
    random_search.fit(X_train, y_train)
    end_time = time.perf_counter()

    return end_time - start_time

# Compare different n_jobs configurations
configurations = [
    (1, 1),
    (n_cores, n_cores),
    (n_cores//2, n_cores//2),
    (1, n_cores),
    (1, n_cores//2),
    (n_cores, 1),
    (n_cores//2, 1)
]

for model_n_jobs, random_search_n_jobs in configurations:
    execution_time = train_and_tune(model_n_jobs, random_search_n_jobs)
    print(f"Model n_jobs: {model_n_jobs}, Random Search n_jobs: {random_search_n_jobs}, Execution Time: {execution_time:.2f} seconds")

You may see results that look as follows:

Model n_jobs: 1, Random Search n_jobs: 1, Execution Time: 11.05 seconds
Model n_jobs: 8, Random Search n_jobs: 8, Execution Time: 11.48 seconds
Model n_jobs: 4, Random Search n_jobs: 4, Execution Time: 7.37 seconds
Model n_jobs: 1, Random Search n_jobs: 8, Execution Time: 5.71 seconds
Model n_jobs: 1, Random Search n_jobs: 4, Execution Time: 6.03 seconds
Model n_jobs: 8, Random Search n_jobs: 1, Execution Time: 7.26 seconds
Model n_jobs: 4, Random Search n_jobs: 1, Execution Time: 5.20 seconds

In this example, we:

Set the OMP_NUM_THREADS environment variable to “1” to limit OpenMP to a single thread.
Generate a synthetic dataset using sklearn.datasets.make_classification.
Define an XGBClassifier with n_estimators set to 100 and random_state set to 42 for reproducibility.
Define a parameter distribution for randomized search with different values for max_depth, learning_rate, and subsample.
Create a function train_and_tune that takes model_n_jobs and random_search_n_jobs as arguments, trains the model, performs randomized search, and returns the execution time.
Compare different n_jobs configurations by calling train_and_tune with various combinations of model_n_jobs and random_search_n_jobs.
Print the execution time for each configuration.

Based on the results, you can determine which configuration of n_jobs works best for your system and dataset.

Experiment with different configurations to find the optimal balance between parallelizing model training and the randomized search itself. Keep in mind that the ideal setup may vary depending on the size of your dataset, the complexity of your model, and the available computational resources.

See Also