XGBRegressor Faster Than LGBMRegressor

When it comes to gradient boosting for regression tasks, both XGBoost and LightGBM are popular choices.

But which one trains faster?

Let’s put XGBRegressor and LGBMRegressor to the test and compare their training times on a synthetic dataset.

Firstly, we must install the lightgbm library using our preferred Python package manager, such as pip:

pip install lightgbm

We can then attempt to compare the performance of both implementations on the same dataset:

from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import time

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100000, n_features=100, noise=0.1, random_state=42)

# Initialize the regressors with comparable hyperparameters
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm_reg = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, verbosity=-1)

# Fit XGBRegressor and measure the training time
start_time = time.perf_counter()
xgb_reg.fit(X, y)
xgb_time = time.perf_counter() - start_time
print(f"XGBRegressor training time: {xgb_time:.2f} seconds")

# Fit LGBMRegressor and measure the training time
start_time = time.perf_counter()
lgbm_reg.fit(X, y)
lgbm_time = time.perf_counter() - start_time
print(f"LGBMRegressor training time: {lgbm_time:.2f} seconds")

We begin by generating a synthetic regression dataset with 10,000 samples and 100 features using scikit-learn’s make_regression function. We add some noise to make the problem more realistic.

Next, we initialize our contenders: XGBRegressor and LGBMRegressor. To ensure a fair comparison, we use similar hyperparameters for both:

n_estimators=100: The number of boosting rounds.
learning_rate=0.1: The learning rate for each boosting round.
max_depth=3: The maximum depth of each decision tree.
random_state=42: For reproducibility.

We then fit each regressor on the dataset and measure the training time using the time module. The start_time is recorded before fitting, and the elapsed time is calculated once fitting is complete.

Finally, we print the training times for both regressors.

Here’s an example of the output you might see:

XGBRegressor training time: 1.17 seconds
LGBMRegressor training time: 1.18 seconds

Below is an updated comparison that repeats each each experiment many times and plots the distributions.

import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100000, n_features=100, noise=0.1, random_state=42)

# Initialize the regressors with comparable hyperparameters
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm_reg = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, verbosity=-1)

# Lists to store training times
xgb_times = []
gb_times = []

# Run the benchmark 10 times
for i in range(10):
    # Measure training time for XGBRegressor
    start_time = time.perf_counter()
    xgb_reg.fit(X, y)
    xgb_duration = time.perf_counter() - start_time
    xgb_times.append(xgb_duration)

    # Measure training time for LGBMRegressor
    start_time = time.perf_counter()
    lgbm_reg.fit(X, y)
    gb_duration = time.perf_counter() - start_time
    gb_times.append(gb_duration)

    # Report progress
    print(f'> {i} xgb: {xgb_duration:.3f}, gb: {gb_duration:.3f}')

# Calculate mean and standard deviation of training times
xgb_mean = np.mean(xgb_times)
xgb_std = np.std(xgb_times)
gb_mean = np.mean(gb_times)
gb_std = np.std(gb_times)

# Print mean and standard deviation of training times
print(f"XGBRegressor mean training time: {xgb_mean:.2f} seconds (std: {xgb_std:.2f})")
print(f"LGBMRegressor mean training time: {gb_mean:.2f} seconds (std: {gb_std:.2f})")

# Plot the distributions as side-by-side boxplots using matplotlib
plt.figure(figsize=(10, 6))
plt.boxplot([xgb_times, gb_times], labels=['XGBoost', 'LGBMRegressor'])
plt.ylabel('Training Time (seconds)')
plt.title('Training Time Comparison')
plt.show()

The results may look something like the following:

> 0 xgb: 1.134, gb: 1.225
> 1 xgb: 1.162, gb: 1.165
> 2 xgb: 1.254, gb: 1.162
> 3 xgb: 1.406, gb: 1.215
> 4 xgb: 1.466, gb: 1.214
> 5 xgb: 1.363, gb: 1.233
> 6 xgb: 1.709, gb: 1.267
> 7 xgb: 1.546, gb: 1.235
> 8 xgb: 1.803, gb: 1.210
> 9 xgb: 1.550, gb: 1.218
XGBRegressor mean training time: 1.44 seconds (std: 0.21)
LGBMRegressor mean training time: 1.21 seconds (std: 0.03)

The exact times will depend on your hardware and setup, but in most cases, you’ll find that XGBRegressor trains slightly slower than LGBMRegressor, but the distributions overlap.

In practice, this means you can train XGBoost models more quickly, allowing you to iterate faster and potentially build better models in less time. So if training speed is a priority for your regression tasks, XGBRegressor is a strong choice.

See Also