When it comes to gradient boosting, XGBoost is renowned for its speed.
But how does it compare to another popular implementation, scikit-learn’s HistGradientBoostingClassifier
?
Let’s put them head-to-head and find out.
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier
from xgboost import XGBClassifier
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000000, n_classes=2, random_state=42)
# Initialize the classifiers with comparable hyperparameters
xgb_clf = XGBClassifier(n_estimators=100, tree_method='hist', learning_rate=0.1, max_depth=3, random_state=42)
hgb_clf = HistGradientBoostingClassifier(max_iter=100, learning_rate=0.1, max_depth=3, random_state=42)
# Fit XGBoostClassifier and measure the training time
start_time = time.perf_counter()
xgb_clf.fit(X, y)
xgb_time = time.perf_counter() - start_time
print(f"XGBoostClassifier training time: {xgb_time:.2f} seconds")
# Fit HistGradientBoostingClassifier and measure the training time
start_time = time.perf_counter()
hgb_clf.fit(X, y)
hgb_time = time.perf_counter() - start_time
print(f"HistGradientBoostingClassifier training time: {hgb_time:.2f} seconds")
We start by generating a synthetic binary classification dataset with 10,000 samples using scikit-learn’s make_classification
.
Next, we set up our contenders: XGBoostClassifier
and HistGradientBoostingClassifier
. To keep the comparison fair, we use similar hyperparameters for both:
n_estimators=100
(ormax_iter=100
forHistGradientBoostingClassifier
): The number of boosting rounds.tree_method='hist'
: To use the histogram method in XGBoost.learning_rate=0.1
: The learning rate for each boosting round.max_depth=3
: The maximum depth of each decision tree.random_state=42
: For reproducibility.
We fit each classifier on the dataset and measure the training time using the time
module. The start_time
is noted before fitting, and the elapsed time is calculated once fitting completes.
Finally, we print the training times for both classifiers.
Here’s an example of the output you might see:
XGBoostClassifier training time: 2.37 seconds
HistGradientBoostingClassifier training time: 3.96 seconds
Below is an updated comparison that repeats each each experiment many times and plots the distributions.
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier
from xgboost import XGBClassifier
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000000, n_classes=2, random_state=42)
# Initialize the classifiers with comparable hyperparameters
xgb_clf = XGBClassifier(n_estimators=100, tree_method='hist', learning_rate=0.1, max_depth=3, random_state=42)
hgb_clf = HistGradientBoostingClassifier(max_iter=100, learning_rate=0.1, max_depth=3, random_state=42)
# Lists to store training times
xgb_times = []
gb_times = []
# Run the benchmark 10 times
for i in range(10):
# Measure training time for XGBoostClassifier
start_time = time.perf_counter()
xgb_clf.fit(X, y)
xgb_duration = time.perf_counter() - start_time
xgb_times.append(xgb_duration)
# Measure training time for HistGradientBoostingClassifier
start_time = time.perf_counter()
hgb_clf.fit(X, y)
gb_duration = time.perf_counter() - start_time
gb_times.append(gb_duration)
# Report progress
print(f'> {i} xgb: {xgb_duration:.3f}, gb: {gb_duration:.3f}')
# Calculate mean and standard deviation of training times
xgb_mean = np.mean(xgb_times)
xgb_std = np.std(xgb_times)
gb_mean = np.mean(gb_times)
gb_std = np.std(gb_times)
# Print mean and standard deviation of training times
print(f"XGBoostClassifier mean training time: {xgb_mean:.2f} seconds (std: {xgb_std:.2f})")
print(f"HistGradientBoostingClassifier mean training time: {gb_mean:.2f} seconds (std: {gb_std:.2f})")
# Plot the distributions as side-by-side boxplots using matplotlib
plt.figure(figsize=(10, 6))
plt.boxplot([xgb_times, gb_times], labels=['XGBoost', 'GradientBoosting'])
plt.ylabel('Training Time (seconds)')
plt.title('Training Time Comparison')
plt.show()
The results may look something like the following:
> 0 xgb: 2.441, gb: 4.252
> 1 xgb: 2.861, gb: 4.010
> 2 xgb: 3.022, gb: 3.986
> 3 xgb: 3.096, gb: 4.073
> 4 xgb: 3.140, gb: 3.915
> 5 xgb: 3.277, gb: 4.236
> 6 xgb: 3.201, gb: 4.038
> 7 xgb: 3.205, gb: 4.295
> 8 xgb: 3.104, gb: 4.095
> 9 xgb: 3.153, gb: 4.013
XGBoostClassifier mean training time: 3.05 seconds (std: 0.23)
HistGradientBoostingClassifier mean training time: 4.09 seconds (std: 0.12)
The exact times will vary based on your hardware and setup, but one thing is clear: XGBoostClassifier
is blazingly fast, often training in less time compared to HistGradientBoostingClassifier
.
This speed advantage is a key reason why XGBoost is so popular among data scientists. By leveraging its efficiency, you can iterate faster, experiment with more features and hyperparameters, and ultimately build better models in less time. So if you’re looking to give your gradient boosting a speed boost, XGBoost is definitely worth considering.