XGBoost Is The Best Algorithm for Tabular Data

Background

Tabular data refers to structured data organized into rows and columns, similar to a spreadsheet or database table, where each row represents an individual record and each column represents a feature or variable.

XGBoost (or more generally Gradient-Boosted Decision Trees) is generally regarded as the most suitable algorithm for tabular data.

This is because it will produce a more skillful model in less time (on average) than other algorithms on tabular datasets. This is particularly important to highlight given the current popularity of neural networks, e.g. XGBoost is generally preferred over neural networks on tabular datasets.

This finding is well supported by large research studies.

This does not mean that the research question of “the best predictive modeling method on datasets we care about” is settled, more work is required (and also the no free lunch theorem is relevant at the limit).

It also does not mean that XGBoost will give the best or fastest model on all datasets, but empirical research does suggest it is expected to outperform other methods, on average.

For example, below are some cherry-picked quotes:

How to Win Kaggle Competitions

The first big development that we saw or the first big contribution I think Kaggle made is we made it very clear that Random Forest was the best algorithm for actually most problems at that time, and then, let’s say about 2014, Tianqi Chen at the University of Washington released XGBoost. I think it was always thought that gradient boosting machines should be better, you know. It’s a very smart approach to run someway decision trees. They should be better. They’re very, very finicky before XGBoost. When Tianqi Chen launched XGBoost, it really took over from Random Forest.

– Anthony Goldbloom, How to Win Kaggle Competitions, Gradient Dissent, 2020.

Tabular Data: Deep Learning is Not All You Need

Our study shows that XGBoost outperforms these deep models across the datasets, including the datasets used in the papers that proposed the deep models. We also demonstrate that XGBoost requires much less tuning.

– Tabular Data: Deep Learning is Not All You Need, 2021.

In conclusion, despite significant progress using deep models for tabular data, they do not outperform XGBoost on the datasets we explored, and further research is probably needed in this area.

– Tabular Data: Deep Learning is Not All You Need, 2021.

Why Do Tree-Based Models Still Outperform Deep Learning On Tabular Data?

Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed.

– Why Do Tree-Based Models Still Outperform Deep Learning On Tabular Data?, 2022.

… our systematic benchmark […] reveals clear trends. On such data, tree-based models more easily yield good predictions, with much less computational cost. This superiority is explained by specific features of tabular data: irregular patterns in the target function, uninformative features, and non rotationally-invariant data where linear combinations of features misrepresent the information.

– Why Do Tree-Based Models Still Outperform Deep Learning On Tabular Data?, 2022.

Deep Neural Networks and Tabular Data: A Survey

Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating.

– Deep Neural Networks and Tabular Data: A Survey, 2022.

Deep neural network-based methods for heterogeneous tabular data are still inferior to machine learning methods based on decision tree ensembles for small and mediumsized data sets (less than ~1M samples).

– Deep Neural Networks and Tabular Data: A Survey, 2022.

Forecasting with Trees

The prevalence of approaches based on gradient boosted trees among the top contestants in the M5 competition is potentially the most eye-catching result. Tree-based methods out-shone other solutions, in particular deep learning-based solutions. The winners in both tracks of the M5 competition heavily relied on them.

– Forecasting with Trees, 2022.

The M5 competition has re-affirmed that tree-based methods and gradient-boosted trees belong to the toolbox of forecasters working on big data, operational forecasting problems. We believe that the sophistication of the existing implementations ensures a strong performance. These include feature processing, appropriate loss functions, execution speed, and robust default parametrization.

– Forecasting with Trees, 2022.

When Do Neural Nets Outperform Boosted Trees on Tabular Data?

… for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs.

– When Do Neural Nets Outperform Boosted Trees on Tabular Data?, 2023.

We found that the ‘NN vs. GBDT’ debate is overemphasized: for a surprisingly high number of datasets, either a simple baseline method performs on par with all other methods, or light hyperparameter tuning on a GBDT increases performance more than choosing the best algorithm. On the other hand, on average, GBDTs do outperform NNs.