I’ve written about TabPFN previously https://macinchem.org/2025/02/06/looking-at-tabpfn/ and I see a technical report has just been published.
https://priorlabs.ai/technical-reports/tabpfn-3
TabPFN is a foundation model trained on around 130,000,000 synthetically generated datasets that mimic “real world” tabular data. These datasets sampled dataset size and number of features, both classification and regression tasks, and Gaussian noise was added to mimic real-world complexities. This can then be used to build models for small- to medium-sized datasets with up to 10,000 samples and 500 features and is claimed to be superior to other methods.
A new performance standard. On the standard tabular benchmark TabArena, a forward pass of TabPFN-3 outperforms all other models, including tuned and ensembled baselines, by a significant margin, and pareto-dominates the speed/performance frontier. TabPFN-3 also scales to more diverse datasets: it ranks first on datasets with many classes, and beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M training rows and 200 features.