TabPFN on Apple Silicon

A recent paper published in Nature caught my eye, Accurate predictions on small data with a tabular foundation model by Hollmann et al.,

Here we present the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model that outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time. 

This foundation model was  trained on around 130,000,000 synthetically generated datasets that mimic “real world” tabular data. These datasets sampled dataset size and number of features, both classification and regression tasks, and Gaussian noise was added to mimic real-world complexities.

They report TabPFN excels in handling small- to medium-sized datasets with up to 10,000 samples and 500 features, this is actually ideal for many projects. Indeed, whilst there is a huge amount of interest in very, very large global models in many cases a smaller local model performs as well or better DOI.

There is a very nice exploration of TabPFN in a cheminformatics setting here TabPFN for chemical datasets that uses the Therapeutic data commons (TDC) and RDKit descriptors. They also provide the python script used. This was used with minor modifications to allow selection of gpu or cpu, and selection of particular datasets. some datasets failed on first pass and the input data needed to be cleaned.

To install

This creates a folder called tabpfn-tdc that contains all the data and submission python script that can be invoked with.

The python script

Results

The results are shown in the table below, the TabFPN performance matches that shown previously.

The first time I ran it a number of the datasets failed and on closer inspection a number of the records needed to be removed or edited. For example

“Butanal, reaction products with aniline”,CCCC=O.Nc1ccccc1,-4.5021013295

dialuminium(3+) ion dimolybdenum nonaoxidandiide,[Al+3].[Al+3].[Mo].[Mo].[O-2].[O-2].[O-2].[O-2].[O-2].[O-2].[O-2].[O-2].[O-2],-4.2291529922

The performance using gpu and cpu was identical and in many cases matched or exceeded the methods described in the TDC leaderboard.

M2 Mac Studio Ultra
DatasetSizeTaskMetricTabFPN performanceCurrent TDC best performanceTabPFN TDC leaderboard rankAve Time (mins)
GPU
Ave Time (mins) CPU
Caco2_Wang906RegressionMAE0.282 ± 0.0050.276 ± 0.0052nd2.11.3
HIA_Hou578ClassificationAUROC0.987 ± 0.0010.990 ± 0.0025th1.20.5
Pgp_Broccatelli1218ClassificationAUROC0.937± 0.0040.938 ± 0.0022nd2.92.2
Bioavailability_Ma640ClassificationAUROC0.732 ± 0.0160.753 ± 0.0005th1.40.73
Bbb_Martins2030ClassificationAUROC0.918 ± 0.0030.920 ± 0.0062nd12.75.75
Vdss_Lombardo1130RegressionSpearman0.693 ± 0.0040.713 ± 0.0073rd2.92.2
Cyp2D6_Substrate_Carbonmangels667ClassificationAUPRC0.717 ± 0.0090.7366th40.8
Cyp3A4_Substrate_Carbonmangels670ClassificationAUROC0.641 ± 0.0040.667 ± 0.0197th4.10.9
Cyp2C9_Substrate_Carbonmangels669ClassificationAUPRC0.400 ± 0.0130.441 ± 0.03310th4.10.85
Half_Life_Obach667RegressionSpearman0.546 ± 0.0130.576 ± 0.0256th4.10.8
Clearance_Microsome_Az1102RegressionSpearman0.632 ± 0.0060.630 ± 0.0101st6.91.9
Clearance_Hepatocyte_Az1213RegressionSpearman0.396 ± 0.0040.536 ± 0.02>10th7.62.4
Herg655ClassificationAUROC0.850 ± 0.0020.880 ± 0.0026th40.8
Dili475ClassificationAUROC0.910 ± 0.0050.925 ± 0.0056th30.5
Larger Datasets
lipophilicity_astrazeneca4200RegressionMAE0.506  ± 0.0050.46 ± 0.0065th2422.9
ppbr_az2790RegressionMAE7.075 +/- 0.0357.505 ± 0.0731st17.610.7
AMES7255BinaryAUROC0.845 ± 0.0020.871 ± 0.0028th6463
solubility_aqsoldb9982RegressionMAE0.756 +/- 0.0030.725 ± 0.0112nd117115
LD50_Zhu7385RegressionMAE0.603 +/- 0.0040.541 ± 0.0154th6969
M1 MacBook Pro Max
HIA_Hou578ClassificationAuROC0.986 ± 0.0010.990 ± 0.0025th3.50.76
Clearance_Microsome_Az1213RegressionMAE0.630 ± 0.0060.630 ± 0.0051st6.72.3

Rather unexpectedly the time taken using cpu for the initial tests was much less than using gpu (mps). I did wonder if I’d got the data switched, but repeating and checking cpu usage as shown below.

In contrast when using mps the gpu usage shot up

However for the larger data sets the times became comparable and around 7500 records the time taken is the same. Whilst inference is not as demanding as building the model it is not clear why cpu is better for the smaller data sets than gpu. As data sets get larger it might be expected that parallelisation on gpu might be advantageous, but I’ll investigate this in a subsequent post.

I also had a look at the performance on a M1 MacBook Pro Max, it was a little slower but still useful.