XGBoosting Home | About | Contact | Examples

XGBClassifier Faster Than CatBoostClassifier

When it comes to gradient boosting, XGBoost and CatBoost are both strong contenders known for their predictive prowess. But in the race to fit your data the fastest, which one takes the lead?

Let’s pit them against each other and see how they stack up.

Before we begin, make sure you have the catboost library installed. If not, you can easily add it to your Python environment using pip:

pip install catboost

Now, let’s set up a fair fight. We’ll generate a large synthetic multiclass classification dataset and fit both classifiers using comparable hyperparameters:

from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import time

# Generate a synthetic multiclass classification dataset
X, y = make_classification(n_samples=100000, n_classes=5, n_informative=10, random_state=42)

# Initialize the classifiers with comparable hyperparameters
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
cat_clf = CatBoostClassifier(iterations=100, learning_rate=0.1, max_depth=3, random_state=42, verbose=False)

# Fit XGBClassifier and measure the training time
start_time = time.perf_counter()
xgb_clf.fit(X, y)
xgb_time = time.perf_counter() - start_time
print(f"XGBClassifier training time: {xgb_time:.2f} seconds")

# Fit CatBoostClassifier and measure the training time
start_time = time.perf_counter()
cat_clf.fit(X, y)
cat_time = time.perf_counter() - start_time
print(f"CatBoostClassifier training time: {cat_time:.2f} seconds")

We use scikit-learn’s make_classification to whip up a dataset with 100,000 samples and 5 classes. This should give our boosters plenty to chew on.

For the contenders, we set up XGBClassifier and CatBoostClassifier with matching hyperparameters:

We time each classifier’s fit method using the time module, starting the clock before calling fit and stopping it once training completes. The elapsed time is then printed.

The moment of truth! Here’s an example of what you might see:

XGBClassifier training time: 1.30 seconds
CatBoostClassifier training time: 2.23 seconds

Below is an updated comparison that repeats each each experiment many times and plots the distributions.

import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=100000, n_classes=2, random_state=42)

# Initialize the classifiers with comparable hyperparameters
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
cat_clf = CatBoostClassifier(iterations=100, learning_rate=0.1, max_depth=3, random_state=42, verbose=False)

# Lists to store training times
xgb_times = []
gb_times = []

# Run the benchmark 10 times
for i in range(10):
    # Measure training time for XGBoostClassifier
    start_time = time.perf_counter()
    xgb_clf.fit(X, y)
    xgb_duration = time.perf_counter() - start_time
    xgb_times.append(xgb_duration)

    # Measure training time for CatBoostClassifier
    start_time = time.perf_counter()
    cat_clf.fit(X, y)
    gb_duration = time.perf_counter() - start_time
    gb_times.append(gb_duration)

    # Report progress
    print(f'> {i} xgb: {xgb_duration:.3f}, gb: {gb_duration:.3f}')

# Calculate mean and standard deviation of training times
xgb_mean = np.mean(xgb_times)
xgb_std = np.std(xgb_times)
gb_mean = np.mean(gb_times)
gb_std = np.std(gb_times)

# Print mean and standard deviation of training times
print(f"XGBoostClassifier mean training time: {xgb_mean:.2f} seconds (std: {xgb_std:.2f})")
print(f"CatBoostClassifier mean training time: {gb_mean:.2f} seconds (std: {gb_std:.2f})")

# Plot the distributions as side-by-side boxplots using matplotlib
plt.figure(figsize=(10, 6))
plt.boxplot([xgb_times, gb_times], labels=['XGBoost', 'CatBoostClassifier'])
plt.ylabel('Training Time (seconds)')
plt.title('Training Time Comparison')
plt.show()

The results may look something like the following:

> 0 xgb: 0.244, gb: 1.186
> 1 xgb: 0.257, gb: 1.118
> 2 xgb: 0.265, gb: 1.127
> 3 xgb: 0.262, gb: 1.132
> 4 xgb: 0.298, gb: 1.147
> 5 xgb: 0.249, gb: 1.123
> 6 xgb: 0.281, gb: 1.129
> 7 xgb: 0.300, gb: 1.129
> 8 xgb: 0.247, gb: 1.117
> 9 xgb: 0.291, gb: 1.129
XGBoostClassifier mean training time: 0.27 seconds (std: 0.02)
CatBoostClassifier mean training time: 1.13 seconds (std: 0.02)

Keep in mind, the exact times will depend on your machine’s specs. But in most cases, you can expect XGBoost to have the edge in speed.

This advantage can be a game-changer when you’re dealing with massive datasets or need to iterate rapidly. By leveraging XGBoost’s swift fitting, you can test more hypotheses and fine-tune your models faster.



See Also