Tabular Data – Machine Learning Lab

Research Topics & Interests

Foundation Models for Tabular Data

TODO

Automated Machine Learning

TODO

Automated Data Science

TOD

Ongoing Projects

Large Language Models Engineer Too Many Simple Features for Tabular Data

Tabular machine learning problems often require time-consuming and labor-intensive feature engineering. Recent efforts have focused on using large language models (LLMs) to capitalize on their potential domain knowledge. At the same time, researchers have observed ethically concerning negative biases in other LLM-related use cases, such as text generation. These developments motivated us to investigate whether LLMs exhibit a bias that negatively impacts the performance of feature engineering. While not ethically concerning, such a bias could hinder practitioners from fully utilizing LLMs for automated data science. Therefore, we propose a method to detect potential biases by detecting anomalies in the frequency of operators (e.g., adding two features) suggested by LLMs when engineering new features. Our experiments evaluate the bias of four LLMs, two big frontier and two small open-source models, across 27 tabular datasets. Our results indicate that LLMs are biased toward simple operators, such as addition, and can fail to utilize more complex operators, such as grouping followed by aggregations. Furthermore, the bias can negatively impact the predictive performance when using LLM-generated features. Our results call for mitigating bias when using LLMs for feature engineering.

Student: Jaris Küken

Open Projects & Thesis Topics

Please refer to https://ml.informatik.uni-freiburg.de/student/ for open topics.

Members

PhD Students

Lennart Purucker

purucker@cs.uni-freiburg.de

Publications

Students

Breenda Das

Lyubomir Ivanov

Jaris Küken

Charlotte Lange

Alumni

Magnus Bühler

Katharina Eggensperger

+49 761 203-98603

eggenspk@cs.uni-freiburg.de

Publications

Matthias Feurer

feurerm@cs.uni-freiburg.de

Publications

Martin Mráz

Publications

2024

Küken, Jaris; Purucker, Lennart; Hutter, Frank

Large Language Models Engineer Too Many Simple Features for Tabular Data Inproceedings

In: NeurIPS 2024 Third Table Representation Learning Workshop, 2024, (Oral Presentation).

Arango, Sebastian Pineda; Janowski, Maciej; Purucker, Lennart; Zela, Arber; Hutter, Frank; Grabocka, Josif

Dynamic Post-Hoc Neural Ensemblers Inproceedings

In: Preprint, 2024.

Helli, Kai; Schnurr, David; Hollmann, Noah; Müller, Samuel; Hutter, Frank

Drift-Resilient TabPFN: In-Context Learning Distribution Shifts on Tabular Data Inproceedings

In: Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), Workshop Track, 2024.

Robertson, Jake; Hollmann, Noah; Awad, Noor; Hutter, Frank

FairPFN: Transformers Can do Counterfactual Fairness Conference

Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), Workshop Track, 2024.

Maier, Jannis; Möller, Felix; Purucker, Lennart

Hardware Aware Ensemble Selection for Balancing Predictive Accuracy and Cost Inproceedings

In: Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), Workshop Track, 2024.

Salinas, David; Erickson, Nick

TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications Inproceedings

In: Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), ABCD Track, 2024.

Bergman, Eddie; Purucker, Lennart; Hutter, Frank

Don’t Waste Your Time: Early Stopping Cross-Validation Inproceedings

In: Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), Methods Track, 2024.

Bergman, Edward; Feurer, Matthias; Bahram, Aron; Balef, Amir Rezaei; Purucker, Lennart; Segel, Sarah; Lindauer, Marius; Hutter, Frank; Eggensperger, Katharina

AMLTK: A Modular AutoML Toolkit in Python Journal Article

In: Journal of Open Source Software, vol. 9, no. 100, pp. 6367, 2024.

2023

Hollmann, Noah; Müller, Samuel; Hutter, Frank

Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering Inproceedings

In: Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.

Purucker, Lennart; Beel, Joeran

CMA-ES for Post Hoc Ensembling in AutoML: A Great Success and Salvageable Failure Conference

AutoML Conference 2023, 2023.

Purucker, Lennart; Schneider, Lennart; Anastacio, Marie; Beel, Joeran; Bischl, Bernd; Hoos, Holger

Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML Conference

AutoML Conference 2023, 2023.

Hollmann, Noah; Müller, Samuel; Eggensperger, Katharina; Hutter, Frank

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second Inproceedings

In: The Eleventh International Conference on Learning Representations (ICLR), 2023, ( top-25% of accepted papers ).

2022

Feurer, Matthias; Eggensperger, Katharina; Falkner, Stefan; Lindauer, Marius; Hutter, Frank

Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning Journal Article

In: Journal of Machine Learning Research, vol. 23, no. 261, pp. 1-61, 2022.

Purucker, Lennart; Beel, Joeran

Assembled-OpenML: Creating Efficient Benchmarks for Ensembles in AutoML with OpenML Conference

First Conference on Automated Machine Learning (Late-Breaking Workshop), 2022.

2021

Bischl, Bernd; Casalicchio, Giuseppe; Feurer, Matthias; Gijsbers, Pieter; Hutter, Frank; Lang, Michel; Mantovani, Rafael G; van Rijn, Jan N; Vanschoren, Joaquin

OpenML Benchmarking Suites Inproceedings

In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.

Kadra, Arlind; Lindauer, Marius; Hutter, Frank; Grabocka, Josif

Well-tuned Simple Nets Excel on Tabular Datasets Inproceedings

In: Thirty-Fifth Conference on Neural Information Processing Systems, 2021.

Zimmer, Lucas; Lindauer, Marius; Hutter, Frank

Auto-Pytorch: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL Journal Article

In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1, 2021.

Feurer, Matthias; van Rijn, Jan N; Kadra, Arlind; Gijsbers, Pieter; Mallik, Neeratyoy; Ravi, Sahithya; Müller, Andreas; Vanschoren, Joaquin; Hutter, Frank

OpenML-Python: an extensible Python API for OpenML Journal Article

In: Journal of Machine Learning Research, vol. 22, no. 100, pp. 1-5, 2021.