XGBoosting Home | About | Contact | Examples

XGBRegressor faster than CatBoostRegressor

When it comes to gradient boosting for regression tasks, both XGBoost and CatBoost are popular choices known for their strong performance and efficiency.

But which one trains faster?

Let’s put them head-to-head and find out.

First, ensure you have the catboost library installed. If not, you can install it using pip:

pip install catboost

Now, let’s set up our speed test:

from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
import time

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100000, n_features=10, noise=0.1, random_state=42)

# Initialize the regressors with comparable hyperparameters
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
cb_reg = CatBoostRegressor(iterations=100, learning_rate=0.1, max_depth=3, random_state=42, verbose=False)

# Fit XGBRegressor and measure the training time
start_time = time.perf_counter()
xgb_reg.fit(X, y)
xgb_time = time.perf_counter() - start_time
print(f"XGBRegressor training time: {xgb_time:.2f} seconds")

# Fit CatBoostRegressor and measure the training time
start_time = time.perf_counter()
cb_reg.fit(X, y)
cb_time = time.perf_counter() - start_time
print(f"CatBoostRegressor training time: {cb_time:.2f} seconds")

We begin by generating a large synthetic regression dataset with 100,000 samples and 10 features using scikit-learn’s make_regression. This provides ample data to observe a significant difference in training times.

Next, we initialize our competitors: XGBRegressor and CatBoostRegressor. For a fair comparison, we use similar hyperparameters for both:

We then fit each regressor on the dataset, measuring the training time using the time module. The start_time is recorded before fitting, and the elapsed time is calculated once fitting completes.

Finally, we print the training times for both regressors.

Here’s an example output:

XGBRegressor training time: 0.16 seconds
CatBoostRegressor training time: 0.74 seconds

Below is an updated comparison that repeats each each experiment many times and plots the distributions.

import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=100000, n_features=10, noise=0.1, random_state=42)

# Initialize the regressors with comparable hyperparameters
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
cb_reg = CatBoostRegressor(iterations=100, learning_rate=0.1, max_depth=3, random_state=42, verbose=False)

# Lists to store training times
xgb_times = []
gb_times = []

# Run the benchmark 10 times
for i in range(10):
    # Measure training time for XGBRegressor
    start_time = time.perf_counter()
    xgb_reg.fit(X, y)
    xgb_duration = time.perf_counter() - start_time
    xgb_times.append(xgb_duration)

    # Measure training time for CatBoostRegressor
    start_time = time.perf_counter()
    cb_reg.fit(X, y)
    gb_duration = time.perf_counter() - start_time
    gb_times.append(gb_duration)

    # Report progress
    print(f'> {i} xgb: {xgb_duration:.3f}, gb: {gb_duration:.3f}')

# Calculate mean and standard deviation of training times
xgb_mean = np.mean(xgb_times)
xgb_std = np.std(xgb_times)
gb_mean = np.mean(gb_times)
gb_std = np.std(gb_times)

# Print mean and standard deviation of training times
print(f"XGBRegressor mean training time: {xgb_mean:.2f} seconds (std: {xgb_std:.2f})")
print(f"CatBoostRegressor mean training time: {gb_mean:.2f} seconds (std: {gb_std:.2f})")

# Plot the distributions as side-by-side boxplots using matplotlib
plt.figure(figsize=(10, 6))
plt.boxplot([xgb_times, gb_times], labels=['XGBoost', 'CatBoostRegressor'])
plt.ylabel('Training Time (seconds)')
plt.title('Training Time Comparison')
plt.show()

The results may look something like the following:

> 0 xgb: 0.142, gb: 0.713
> 1 xgb: 0.145, gb: 0.668
> 2 xgb: 0.149, gb: 0.669
> 3 xgb: 0.183, gb: 0.666
> 4 xgb: 0.152, gb: 0.671
> 5 xgb: 0.190, gb: 0.674
> 6 xgb: 0.146, gb: 0.698
> 7 xgb: 0.147, gb: 0.670
> 8 xgb: 0.165, gb: 0.792
> 9 xgb: 0.174, gb: 0.691
XGBRegressor mean training time: 0.16 seconds (std: 0.02)
CatBoostRegressor mean training time: 0.69 seconds (std: 0.04)

Exact times will vary based on your hardware, but in this case, XGBRegressor trains about twice as fast as CatBoostRegressor.

This speed advantage can be significant when working with large datasets or when you need to iterate quickly. By leveraging XGBoost’s speed, you can experiment with more features and hyperparameters, ultimately building better models faster.

Of course, training speed isn’t the only consideration. CatBoost has its own strengths, like excellent handling of categorical features. The best choice depends on your specific dataset and requirements.



See Also