When it comes to gradient boosting for regression tasks, both XGBoost and LightGBM are popular choices.
But which one trains faster?
Let’s put XGBRegressor and LGBMRegressor to the test and compare their training times on a synthetic dataset.
Firstly, we must install the lightgbm
library using our preferred Python package manager, such as pip:
pip install lightgbm
We can then attempt to compare the performance of both implementations on the same dataset:
from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import time
# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100000, n_features=100, noise=0.1, random_state=42)
# Initialize the regressors with comparable hyperparameters
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm_reg = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, verbosity=-1)
# Fit XGBRegressor and measure the training time
start_time = time.perf_counter()
xgb_reg.fit(X, y)
xgb_time = time.perf_counter() - start_time
print(f"XGBRegressor training time: {xgb_time:.2f} seconds")
# Fit LGBMRegressor and measure the training time
start_time = time.perf_counter()
lgbm_reg.fit(X, y)
lgbm_time = time.perf_counter() - start_time
print(f"LGBMRegressor training time: {lgbm_time:.2f} seconds")
We begin by generating a synthetic regression dataset with 10,000 samples and 100 features using scikit-learn’s make_regression
function. We add some noise to make the problem more realistic.
Next, we initialize our contenders: XGBRegressor
and LGBMRegressor
. To ensure a fair comparison, we use similar hyperparameters for both:
n_estimators=100
: The number of boosting rounds.learning_rate=0.1
: The learning rate for each boosting round.max_depth=3
: The maximum depth of each decision tree.random_state=42
: For reproducibility.
We then fit each regressor on the dataset and measure the training time using the time
module. The start_time
is recorded before fitting, and the elapsed time is calculated once fitting is complete.
Finally, we print the training times for both regressors.
Here’s an example of the output you might see:
XGBRegressor training time: 1.17 seconds
LGBMRegressor training time: 1.18 seconds
Below is an updated comparison that repeats each each experiment many times and plots the distributions.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100000, n_features=100, noise=0.1, random_state=42)
# Initialize the regressors with comparable hyperparameters
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm_reg = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, verbosity=-1)
# Lists to store training times
xgb_times = []
gb_times = []
# Run the benchmark 10 times
for i in range(10):
# Measure training time for XGBRegressor
start_time = time.perf_counter()
xgb_reg.fit(X, y)
xgb_duration = time.perf_counter() - start_time
xgb_times.append(xgb_duration)
# Measure training time for LGBMRegressor
start_time = time.perf_counter()
lgbm_reg.fit(X, y)
gb_duration = time.perf_counter() - start_time
gb_times.append(gb_duration)
# Report progress
print(f'> {i} xgb: {xgb_duration:.3f}, gb: {gb_duration:.3f}')
# Calculate mean and standard deviation of training times
xgb_mean = np.mean(xgb_times)
xgb_std = np.std(xgb_times)
gb_mean = np.mean(gb_times)
gb_std = np.std(gb_times)
# Print mean and standard deviation of training times
print(f"XGBRegressor mean training time: {xgb_mean:.2f} seconds (std: {xgb_std:.2f})")
print(f"LGBMRegressor mean training time: {gb_mean:.2f} seconds (std: {gb_std:.2f})")
# Plot the distributions as side-by-side boxplots using matplotlib
plt.figure(figsize=(10, 6))
plt.boxplot([xgb_times, gb_times], labels=['XGBoost', 'LGBMRegressor'])
plt.ylabel('Training Time (seconds)')
plt.title('Training Time Comparison')
plt.show()
The results may look something like the following:
> 0 xgb: 1.134, gb: 1.225
> 1 xgb: 1.162, gb: 1.165
> 2 xgb: 1.254, gb: 1.162
> 3 xgb: 1.406, gb: 1.215
> 4 xgb: 1.466, gb: 1.214
> 5 xgb: 1.363, gb: 1.233
> 6 xgb: 1.709, gb: 1.267
> 7 xgb: 1.546, gb: 1.235
> 8 xgb: 1.803, gb: 1.210
> 9 xgb: 1.550, gb: 1.218
XGBRegressor mean training time: 1.44 seconds (std: 0.21)
LGBMRegressor mean training time: 1.21 seconds (std: 0.03)
The exact times will depend on your hardware and setup, but in most cases, you’ll find that XGBRegressor
trains slightly slower than LGBMRegressor
, but the distributions overlap.
In practice, this means you can train XGBoost models more quickly, allowing you to iterate faster and potentially build better models in less time. So if training speed is a priority for your regression tasks, XGBRegressor is a strong choice.