Menu

Reinforcement Learning

Research Topics & Interests

Deep Reinforcement Learning

We tackle key challenges in deep reinforcement learning, aiming to make RL systems both robust and efficient. Our research focuses on enabling agents to learn optimal timing for actions, develop a contextual understanding of complex environments, and build predictive world models that improve decision-making.

Automated Reinforcement Learning

We design methods to automate the development and optimization of reinforcement learning systems, making RL more accessible and efficient. Our research emphasizes creating benchmark environments and pioneering dynamic, as well as gray-box optimization approaches. These efforts have led to impactful contributions in the field, including novel benchmarks and optimization techniques that have gained widespread recognition in the AutoML community for advancing ease of use and performance in RL systems.

Dynamic Algorithm Configuration

We established new foundations in the field of Dynamic Algorithm Configuration (DAC), moving beyond static algorithm configuration to enable real-time parameter adaptation. Our work introduced frameworks for algorithms to adjust their hyperparameters while running, leading to significant efficiency improvements. This research has practical applications across various domains, from evolutionary computation to classical optimization problems and reinforcement learning.

Deep reinforcement learning (RL) algorithms are a powerful class of methods for optimizing sequential decision making and control problems, and are the driver behind many real-world applications. Yet, they are also very sensitive to their hyperparameters, such as the discount factor or the learning rate. This leads to considerable uncertainty with respect to optimal use. Setting these hyperparameters is a notoriously difficult task. At the same time, it is crucial for obtaining state-of-the-art performance. This translates into multiple costly training runs to find a well-performing agent. The problem is exacerbated in deep RL application scenarios that only allow for small amounts of data to be collected, such as in many biomedical applications. For example in the optimization of drug dosage schedules in cell cultures, the elapsed time between collected data points with meaningful variability is typically very large, such that only few data points can realistically be collected within a given time frame. Consequently, we aim to develop automated approaches for setting the hyperparameters of deep RL approaches in a sample-efficient manner for small datasets, thus reducing uncertainty.

Student: Asif Hasan

Traditional methods for Algorithm Configuration (AC) typically predict fixed parameter settings, yet research has shown that dynamically adapting parameters over time can significantly improve performance. This insight has led to the development of many handcrafted heuristics. To better respond to changing optimization dynamics, automatic Dynamic Algorithm Configuration (DAC) has been developed, with most current approaches relying on online Reinforcement Learning (RL). However, online RL is resource-intensive and, due to its need for extensive interactions with the environment, is not feasible for certain domains. Offline RL mitigates these challenges by training on pre-collected datasets, although it introduces new challenges, such as distributional divergence. This work explores the application of offline RL for DAC across various domains. We demonstrate that offline RL can effectively adapt the learning rate for Stochastic Gradient Descent (SGD).

Students: Leon Gieringer, Janis Fix & Asif Hasan

Contextual reinforcement learning has shown to be a viable avenue to train generalizable agents by providing meta-information about the environments dynamics (such as the weight of a robot or the gravity of a planet). However, this setting has so far only been studied extensively in the online RL and model-based RL settings. Offline RL promises an alternative in which well performing policies are learned purely from existing datasets. In this work we set out to study the generalization capabilities of offline RL agents, when providing them with context information about the environment at hand.

Student: Rachana Tirumanyam

This project explores quantum-inspired representations for decision-making policies, challenging the conventional approach of modeling policies as simple probability distributions over actions. While traditional methods have proven effective in many scenarios, they often fail to capture the complex interdependencies and contextual nuances that arise in sophisticated decision environments. By leveraging concepts from quantum mechanics we aim to develop a richer representational framework that can express intricate decision-making patterns more naturally.

Student: Sathya Kamesh Bethanabhotla

This project investigates geometric formulations of contextual reinforcement learning, moving beyond the limitations of classical approaches. While conventional contextual RL acknowledges context variables, it typically embeds them into standard learning pipelines without exploiting their structural properties, leading to sample-inefficient learning that heavily relies on deep neural networks to discover relationships through extensive training. Our geometric perspective explicitly models the intrinsic structure of context spaces to directly capture the underlying manifold of related environments. By formalizing these relationships mathematically rather than leaving them for neural networks to implicitly approximate, we aim to develop theoretically grounded algorithms that require significantly fewer samples and generalize more reliably.

Student: Premraj Thakur

This thesis investigates the impact of dynamic contextual information on reinforcement learning performance, where context is modeled as a continuous variable that evolves throughout the learning process. Unlike traditional approaches that treat context as discrete or static, we examine how continuously shifting environmental and task parameters affect agent training dynamics and convergence properties. The project focuses on developing and evaluating novel training protocols that can effectively handle temporal context variations, with particular attention to sample efficiency, policy robustness, and generalization capabilities across different rates of contextual change.

Student: Martin Mráz

In this project, we explore the fundamental relationship between context and multi-agent reinforcement learning systems, investigating how contextual information manifests and influences behavior in environments with multiple learning agents. This work examines the multi-faceted nature of context in such settings, where contextual factors may emerge from agent interactions, environmental conditions, or the collective system state itself. We aim to develop a deeper understanding of what constitutes meaningful context in multi-agent scenarios and how different contextual representations affect coordination, emergence, and learning dynamics.

Student: Jonas Kölblin

Unfortunately, at the current Moment we can not offer any RL specific projects or thesis. Please refer to https://ml.informatik.uni-freiburg.de/student/ for other open topics.

Model-based RL has shown tremendous sample efficiency and applicability in a broad variety of problem domains. However, typical benchmarks in which these models are being evaluated ignore a crucial aspect of real world scenarios -- partial observability. This work aims at understanding the impact of partial observability on model-based agents and tries to shed light on how world-models of partially observable worlds encode the underlying environment state.

Student: Sai Prasanna

This study investigates how different decoding strategies affect the performance of large language models when used as automated evaluators. Comparing top-k sampling, beam search, and chain-of-thought approaches using LLaMA 3.1-8B-Instruct, the research reveals that all three strategies achieve similar accuracy levels, but top-k sampling provides the best computational efficiency. More significantly, the study demonstrates that prompt design has a greater impact on judge performance than the choice of decoding algorithm.

Student: Shaza Kawoosa

Deep reinforcement learning models tend to overfit to early experiences.There are many potential aspects that could be the root cause. In particular, deep learning tends to greedily follow the learning signal. However, in settings such as reinforcement or even continual learning, this signal changes over time, requiring more plasticity of the learning process than classical supervised settings. This project studies a novel approach to regularization to preserve plasticity in the model and explores the impact in various learning settings.

Student: Philipp Bordne

This research addresses the challenge of high-dimensional action spaces in dynamic algorithm configuration by introducing the concept of Coupled Action Dimensions with Importance Differences (CANDID). The work develops a new benchmark within the DACBench suite and proposes sequential policies that factorize the action space to manage exponential growth while maintaining coordination between interdependent dimensions. Experimental results demonstrate that sequential policies significantly outperform both independent factorized policies and single policies across all dimensions, offering an effective solution for scalable dynamic algorithm configuration in complex action spaces.

Student: Philipp Bordne

This thesis explores using offline reinforcement learning to dynamically adapt learning rates for Stochastic Gradient Descent, moving beyond static hand-crafted schedules that cannot respond to optimization dynamics. The work integrates multiple teacher schedules into the offline RL framework to enable automatic hyperparameter adaptation during training. Results show that learned schedules exhibit novel, high-performing behavior across diverse problems, with multi-teacher learning providing substantial improvements over single-teacher approaches. Notably, even simple teacher schedules can serve as effective foundations for developing sophisticated learning rate adaptation strategies including warm-starting techniques.

Student: Janis Fix

Dynamic Algorithm Configuration adapts algorithm parameters during execution to improve performance, but current online reinforcement learning approaches are resource-intensive and impractical for many domains. This research demonstrates that offline RL offers a viable alternative, successfully adapting learning rates for Stochastic Gradient Descent and step-sizes for CMA-ES using pre-collected datasets. The study compares different offline RL approaches and teacher-generated datasets, revealing that problem domains require varying balances between replicating teacher behavior and exploring deviations. Implicitly constrained methods prove more conservative and reliable for neural network optimization, while uncertainty-based algorithms show better generalization across multiple functions. Notably, sophisticated parameter schedules can be learned even from relatively simple teacher policies.

Student: Leon Gieringer

This research demonstrates that smaller language models (Llama 2.7B and Llama 3.8B) can achieve near-human performance on Multi-state Bar Examination questions through supervised fine-tuning with a limited dataset of 1,514 questions. The study evaluates domain-specific fine-tuning across seven legal domains, using IRAC-structured reasoning and comparing various configurations including prompt types, answer ordering, and response formats. Results show that targeted fine-tuning enables some model configurations to match human baseline performance despite constrained computational resources

Student: Rean Clive Fernandes

Differential Evolution requires careful hyperparameter tuning that can be challenging for inexperienced users. This work proposes using offline Twin Delayed Deep Deterministic Policy Gradients (TD3) for Dynamic Algorithm Configuration, automatically adjusting DE's mutation rate during execution. Applied to benchmark optimization problems, the offline TD3 approach learns superior policies that improve upon established heuristics in most cases, while demonstrating better performance and stability compared to online TD3. The method offers an effective solution for automated hyperparameter adaptation in evolutionary algorithms.

Student: Florian Diederichs

This research explores fine-grained Dynamic Algorithm Configuration for Differential Evolution by adapting scaling factors at the individual level rather than population level. The approach uses reinforcement learning to predict scaling factors directly instead of sampling from distributions, while incorporating gradient information to guide optimization. Results demonstrate that trained agents learn effectively from teacher policies and occasionally outperform them, with direct scaling factor prediction and gradient information integration leading to improved solution quality compared to traditional population-level adaptation methods.

Student: Leon Gieringer & Janis Fix

This work addresses the slow initialization problem in Population-Based Bandits (PB2), a method for dynamically optimizing reinforcement learning hyperparameters using time-varying Gaussian processes. Four novel meta-learning approaches are proposed that leverage meta-data from various environments to accelerate early performance, with MultiTaskPB2 applying meta-learning directly to the surrogate model.

Student: Johannes Hog

Members

Postdoctoral Research Fellows

PhD Students

Students

Sathya Kamesh Bethanabhotla

Philipp Bordne

Jonas Kölblin

Martin Mráz

Premraj Thakur

Rachana Tirumanyam

Alumni

Tidiane Camaret Ndir

Rean Clive Fernandes

Florian Diederichs

Janis Fix

Leon Gieringer

Shaza Kawoosa

Sai Prasanna

Zainab Sultan

Jan Ole von Hartz

Publications

2025

Prasanna, Sai; Biedenkapp, André; Rajan, Raghu

One Does Not Simply Estimate State: Comparing Model-based and Model-free Reinforcement Learning on the Partially Observable MordorHike Benchmark Inproceedings Forthcoming

In: Eighteenth European Workshop on Reinforcement Learning, Forthcoming.

Mohan, Aditya; Eimer, Theresa; Benjamins, Carolin; Lindauer, Marius; Biedenkapp, André

Mighty: A Comprehensive Tool for studying Generalization, Meta-RL and AutoRL Inproceedings Forthcoming

In: Eighteenth European Workshop on Reinforcement Learning, Forthcoming.

Nguyen, Tài; Le, Phong; Biedenkapp, André; Doerr, Carola; Dang, Nguyen

On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration: A Case Study on OneMax with (1+(λ,λ))-GA Inproceedings

In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO'25), 2025, (Won the best paper award in the L4EC track).

Shala, Gresa; Biedenkapp, André; Krack, Pierre; Walter, Florian; Grabocka, Josif

Efficient Cross-Episode Meta-RL Inproceedings

In: Proceedings of the Thirteenth International Conference on Learning Representations (ICLR'25), 2025.

Fernandes, Rean; Biedenkapp, André; Hutter, Frank; Awad, Noor

A Llama walks into the ’Bar’: Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam Journal Article

In: arXiv:2504.04945 [cs.LG], 2025.

Hog, Johannes; Rajan, Raghu; Biedenkapp, André; Awad, Noor; Hutter, Frank; Nguyen, Vu

Meta-learning Population-based Methods for Reinforcement Learning Journal Article

In: Transactions on Machine Learning Research, 2025, ISBN: 2835-8856.

2024

Ferreira, Fabio; Schlageter, Moreno; Rajan, Raghu; Biedenkapp, André; Hutter, Frank

One-shot World Models Using a Transformer Trained on a Synthetic Prior Inproceedings

In: NeurIPS 2024 Workshop on Open-World Agents, 2024.

Ndir, Tidiane Camaret; Biedenkapp, André; Awad, Noor

Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement Learning Inproceedings

In: Seventeenth European Workshop on Reinforcement Learning, 2024.

Bordne, Philipp; Hasan, M. Asif; Bergman, Eddie; Awad, Noor; Biedenkapp, André

CANDID DAC: Leveraging Coupled Action Dimensions with Importance Differences in DAC Inproceedings

In: Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), Workshop Track, 2024.

Shala, Gresa; Arango, Sebastian Pineda; Biedenkapp, André; Hutter, Frank; Grabocka, Josif

HPO-RL-Bench: A Zero-Cost Benchmark for HPO in Reinforcement Learning Inproceedings

In: Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), ABCD Track, 2024, (Runner up for the best paper award).

Prasanna, Sai; Farid, Karim; Rajan, Raghu; Biedenkapp, André

Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization Journal Article

In: Reinforcement Learning Journal, vol. 3, iss. 1, no. 1, pp. 1317–1350, 2024, ISBN: 979-8-218-41163-3.

Shala, Gresa; Biedenkapp, André; Grabocka, Josif

Hierarchical Transformers are Efficient Meta-Reinforcement Learners Journal Article

In: arXiv:2402.06402 [cs.LG], 2024.

2023

Rajan, Raghu; Diaz, Jessica Lizeth Borja; Guttikonda, Suresh; Ferreira, Fabio; Biedenkapp, André; von Hartz, Jan Ole; Hutter, Frank

MDP Playground: An Analysis and Debug Testbed for Reinforcement Learning Journal Article

In: Journal of Artificial Intelligence Research (JAIR), vol. 77, pp. 821-890, 2023.

Benjamins, Carolin; Eimer, Theresa; Schubert, Frederik; Mohan, Aditya; Döhler, Sebastian; Biedenkapp, André; Rosenhan, Bodo; Hutter, Frank; Lindauer, Marius

Contextualize Me - The Case for Context in Reinforcement Learning Journal Article

In: Transactions on Machine Learning Research, 2023, ISBN: 2835-8856.

Shala, Gresa; Biedenkapp, André; Hutter, Frank; Grabocka, Josif

Gray-Box Gaussian Processes for Automated Reinforcement Learning Inproceedings

In: Eleventh International Conference on Learning Representations (ICLR'23), 2023.

2022

Shala, Gresa; Arango, Sebastian Pineda; Biedenkapp, André; Hutter, Frank; Grabocka, Josif

AutoRL-Bench 1.0 Inproceedings

In: Workshop on Meta-Learning (MetaLearn@NeurIPS'22), 2022.

Shala, Gresa; Biedenkapp, André; Hutter, Frank; Grabocka, Josif

Gray-Box Gaussian Processes for Automated Reinforcement Learning Inproceedings

In: Workshop on Meta-Learning (MetaLearn@NeurIPS'22), 2022.

Biedenkapp, André

Dynamic Algorithm Configuration by Reinforcement Learning PhD Thesis

University of Freiburg, Department of Computer Science, 2022.

Sass, René; Bergman, Eddie; Biedenkapp, André; Hutter, Frank; Lindauer, Marius

DeepCAVE: An Interactive Analysis Tool for Automated Machine Learning Inproceedings

In: Workshop on Adaptive Experimental Design and Active Learning in the Real World (ReALML@ICML'22), 2022.

Biedenkapp, André; Dang, Nguyen; Krejca, Martin S.; Hutter, Frank; Doerr, Carola

Theory-inspired Parameter Control Benchmarks for Dynamic Algorithm Configuration Inproceedings

In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO'22), 2022, (Won the best paper award in the GECH track).

Biedenkapp, André; Speck, David; Sievers, Silvan; Hutter, Frank; Lindauer, Marius; Seipp, Jendrik

Learning Domain-Independent Policies for Open List Selection Inproceedings

In: Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL @ ICAPS'22), 2022.

Parker-Holder, Jack; Rajan, Raghu; Song, Xingyou; Biedenkapp, André; Miao, Yingjie; Eimer, Theresa; Zhang, Baohe; Nguyen, Vu; Calandra, Roberto; Faust, Aleksandra; Hutter, Frank; Lindauer, Marius

Automated Reinforcement Learning (AutoRL): A Survey and Open Problems Journal Article

In: Journal of Artificial Intelligence Research (JAIR), vol. 74, pp. 517-568, 2022.

Benjamins, Carolin; Eimer, Theresa; Schubert, Frederik; Mohan, Aditya; Biedenkapp, André; Rosenhan, Bodo; Hutter, Frank; Lindauer, Marius

Contextualize Me – The Case for Context in Reinforcement Learning Journal Article

In: arXiv:2202.04500, 2022.