RDA: Reward Design Agent for Reinforcement Learning

1Meta; 2KAIST; 3NYU;
Corresponding authors: joonleesky@kaist.ac.kr, nitinkamra@meta.com

Abstract

Reinforcement learning has enabled impressive robotic skills, from dexterous manipulation to humanoid locomotion, but still depends on hand-crafted reward functions that are slow to design and error-prone. Eureka reduces this burden by generating reward code from task instructions with an LLM, but it revises rewards only from coarse reward statistics, lacking semantic interpretation of behavior and often inducing reward hacking. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that decomposes tasks, semantically evaluates behavior, and iteratively revises reward code to better follow instructions. Evaluated with GPT-5 on 12 ManiSkill manipulation tasks and 4 HumanoidBench loco-manipulation tasks, RDA outperforms human-designed rewards and Eureka, and is the only method to solve plug-insertion and humanoid package-delivery. By leveraging semantic evaluation and reflection, RDA produces instruction-aligned behavior while mitigating the reward-hacking shortcuts common in other methods.

RDA Overview

Initialization: Given a natural-language instruction and a simulator environment, RDA first decomposes the instruction into a list of subtasks. Conditioned on the instruction, environment, and subtasks, it then generates an initial pool of reward candidates to seed the search process.
Evolutionary Search: During the evolutionary loop, each reward function is used to train a policy in simulation. Multiple trajectories are sampled from each policy and visually evaluated by a Vision-Language Model (VLM), which produces analysis reports containing subtask-wise scores and evaluation rationales. These reports are aggregated across trajectories into a summary, which is then used to revise both the subtask list and the reward code. This process repeats for a fixed number of iterations.

RDA Evolution Process

Example on Humanoid Package. In this example, the humanoid learns to push the package with its torso rather than both hands. RDA’s visual analysis diagnoses this failure mode (weak bilateral hand contact) and refines the subtask specification and reward function to enforce pushing with bilateral contact.

Experiments

We thoroughly evaluate RDA on a diverse set of robotic control tasks.


Environments: 12 tabletop manipulation tasks in ManiSkill and 4 loco-manipulation tasks in HumanoidBench.


Training: RDA uses GPT-5 with medium reasoning effort as the VLM backbone. All policies are trained using SAC + SimbaV2 architecture. Each run consists of 5 evolutionary iterations, 4 reward candidates and 50M environment steps on Maniskill (8 reward candidates and 10M environment steps for HumanoidBench).


Evaluation: We use two complementary metrics to assess policy: Alignment Rate: Measures how well the policy's behavior aligns with the task instruction. We use GPT-4.1 to evaluate 5 rollout videos per trained policy to gauge task instruction alignment on a 5-point Likert scale (normalized in [0,1]). Success Rate: Measures task completion using the binary success indicator provided by each benchmark (e.g., object within target tolerance). But does not assess behavioral alignment of the policy with the specified instruction.

Evaluation Results

RDA generates highly task-aligned policies. On ManiSkill, RDA and Eureka consistently outperform human-designed rewards, achieving strong alignment and success. On HumanoidBench, RDA matches Eureka in success and surpasses in alignment, while human rewards fall short. Eureka’s lack of visual analysis leads to mis-specified rewards.

RDA Rewards and Policies

In this demo, we visualize the best RDA rewards on each environment and policies trained using them.


ManiSkill

<b>LiftPegUpright</b>, best RDA reward:
                        [sep]
                        rewards/rda/LiftPegUpright-v1-1.txt <b>PegInsertionSide</b>, best RDA reward:
                        [sep]
                        rewards/rda/PegInsertionSide-v1-1.txt <b>PickCube</b>, best RDA reward:
                        [sep]
                        rewards/rda/PickCube-v1-1.txt <b>PlaceSphere</b>, best RDA reward:
                        [sep]
                        rewards/rda/PlaceSphere-v1-1.txt <b>PlugCharger</b>, best RDA reward:
                        [sep]
                        rewards/rda/PlugCharger-v1-1.txt <b>PokeCube</b>, best RDA reward:
                        [sep]
                        rewards/rda/PokeCube-v1-1.txt <b>PullCube</b>, best RDA reward:
                        [sep]
                        rewards/rda/PullCube-v1-1.txt <b>PullCubeTool</b>, best RDA reward:
                        [sep]
                        rewards/rda/PullCubeTool-v1-1.txt <b>PushCube</b>, best RDA reward:
                        [sep]
                        rewards/rda/PushCube-v1-1.txt <b>PushT</b>, best RDA reward:
                        [sep]
                        rewards/rda/PushT-v1-1.txt <b>RollBall</b>, best RDA reward:
                        [sep]
                        rewards/rda/RollBall-v1-1.txt <b>StackCube</b>, best RDA reward:
                        [sep]
                        rewards/rda/StackCube-v1-1.txt

Humanoid Bench

<b>h1hand-basketball</b>, best RDA reward:
                        [sep]
                        rewards/rda/h1hand-basketball-v0-1.txt <b>h1hand-door</b>, best RDA reward:
                        [sep]
                        rewards/rda/h1hand-door-v0-1.txt <b>h1hand-package</b>, best RDA reward:
                        [sep]
                        rewards/rda/h1hand-package-v0-1.txt <b>h1hand-push</b>, best RDA reward:
                        [sep]
                        rewards/rda/h1hand-push-v0-1.txt
Select an image above:
RDA response shown within code block.

BibTeX

@article{lee2026rda,
    title   = {Reward Design Agent for Reinforcement Learning},
    author  = {Hojoon Lee and Ajay Subramanian and Ben Abbatematteo and Vijay Veerabadran and Pedro Matias and Karl Ridgeway and Nitin Kamra},
    year    = {2026},
    journal = {Reinforcement Learning Conference (RLC)}
}