When it comes to gradient boosting, both XGBoost and LightGBM are popular choices known for their speed and efficiency.
But in a head-to-head comparison, which one comes out on top?
Let’s put them to the test and find out.
Firstly, we must install the lightgbm
library using our preferred Python package manager, such as pip:
pip install lightgbm
We can then attempt to compare the performance of both implementations on the same dataset:
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000000, n_classes=2, random_state=42)
# Initialize the classifiers with comparable hyperparameters
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm_clf = LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, verbosity=-1)
# Fit XGBClassifier and measure the training time
start_time = time.perf_counter()
xgb_clf.fit(X, y)
xgb_time = time.perf_counter() - start_time
print(f"XGBClassifier training time: {xgb_time:.2f} seconds")
# Fit LGBMClassifier and measure the training time
start_time = time.perf_counter()
lgbm_clf.fit(X, y)
lgbm_time = time.perf_counter() - start_time
print(f"LGBMClassifier training time: {lgbm_time:.2f} seconds")
We start by generating a large synthetic binary classification dataset with 100,000 samples using scikit-learn’s make_classification
. This ensures we have enough data to see a noticeable difference in fitting times.
Next, we set up our contenders: XGBClassifier
and LGBMClassifier
. To keep the comparison fair, we use the same hyperparameters for both:
n_estimators=100
: The number of boosting rounds.learning_rate=0.1
: The learning rate for each boosting round.max_depth=3
: The maximum depth of each decision tree.random_state=42
: For reproducibility.
We fit each classifier on the dataset and measure the training time using the time
module. The start_time
is noted before fitting, and the elapsed time is calculated once fitting completes.
Finally, we print the training times for both classifiers.
Here’s an example of the output you might see:
XGBClassifier training time: 2.44 seconds
LGBMClassifier training time: 2.00 seconds
Below is an updated comparison that repeats each each experiment many times and plots the distributions.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000000, n_classes=2, random_state=42)
# Initialize the classifiers with comparable hyperparameters
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgbm_clf = LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, verbosity=-1)
# Lists to store training times
xgb_times = []
gb_times = []
# Run the benchmark 10 times
for i in range(10):
# Measure training time for XGBoostClassifier
start_time = time.perf_counter()
xgb_clf.fit(X, y)
xgb_duration = time.perf_counter() - start_time
xgb_times.append(xgb_duration)
# Measure training time for LGBMClassifier
start_time = time.perf_counter()
lgbm_clf.fit(X, y)
gb_duration = time.perf_counter() - start_time
gb_times.append(gb_duration)
# Report progress
print(f'> {i} xgb: {xgb_duration:.3f}, gb: {gb_duration:.3f}')
# Calculate mean and standard deviation of training times
xgb_mean = np.mean(xgb_times)
xgb_std = np.std(xgb_times)
gb_mean = np.mean(gb_times)
gb_std = np.std(gb_times)
# Print mean and standard deviation of training times
print(f"XGBoostClassifier mean training time: {xgb_mean:.2f} seconds (std: {xgb_std:.2f})")
print(f"LGBMClassifier mean training time: {gb_mean:.2f} seconds (std: {gb_std:.2f})")
# Plot the distributions as side-by-side boxplots using matplotlib
plt.figure(figsize=(10, 6))
plt.boxplot([xgb_times, gb_times], labels=['XGBoost', 'LGBMClassifier'])
plt.ylabel('Training Time (seconds)')
plt.title('Training Time Comparison')
plt.show()
The results may look something like the following:
> 0 xgb: 2.437, gb: 2.230
> 1 xgb: 2.745, gb: 1.997
> 2 xgb: 2.616, gb: 2.060
> 3 xgb: 3.018, gb: 2.096
> 4 xgb: 2.835, gb: 2.087
> 5 xgb: 3.093, gb: 2.286
> 6 xgb: 3.240, gb: 2.310
> 7 xgb: 3.198, gb: 2.165
> 8 xgb: 3.294, gb: 2.407
> 9 xgb: 3.508, gb: 2.166
XGBoostClassifier mean training time: 3.00 seconds (std: 0.32)
LGBMClassifier mean training time: 2.18 seconds (std: 0.12)
The exact times will vary based on your hardware and setup, but in this case, XGBClassifier
comes slightly behind the LGBMClassifier
.
Of course, speed isn’t everything. LightGBM has its own strengths, such as the ability to handle categorical features directly. The best choice will depend on your specific dataset and requirements.