Reinforcement learning has enabled impressive robotic skills, from dexterous manipulation to humanoid locomotion, but still depends on hand-crafted reward functions that are slow to design and error-prone. Eureka reduces this burden by generating reward code from task instructions with an LLM, but it revises rewards only from coarse reward statistics, lacking semantic interpretation of behavior and often inducing reward hacking. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that decomposes tasks, semantically evaluates behavior, and iteratively revises reward code to better follow instructions. Evaluated with GPT-5 on 12 ManiSkill manipulation tasks and 4 HumanoidBench loco-manipulation tasks, RDA outperforms human-designed rewards and Eureka, and is the only method to solve plug-insertion and humanoid package-delivery. By leveraging semantic evaluation and reflection, RDA produces instruction-aligned behavior while mitigating the reward-hacking shortcuts common in other methods.
We thoroughly evaluate RDA on a diverse set of robotic control tasks.
Environments:
12 tabletop manipulation tasks in ManiSkill and 4 loco-manipulation tasks in HumanoidBench.
Training: RDA uses GPT-5 with medium reasoning effort as the VLM backbone. All policies are trained using SAC + SimbaV2 architecture. Each run consists of 5 evolutionary iterations, 4 reward candidates and 50M environment steps on Maniskill (8 reward candidates and 10M environment steps for HumanoidBench).
Evaluation:
We use two complementary metrics to assess policy:
Alignment Rate:
Measures how well the policy's behavior aligns with the task instruction. We use GPT-4.1
to evaluate 5 rollout videos per trained policy to gauge task instruction alignment on a
5-point Likert scale (normalized in [0,1]).
Success Rate:
Measures task completion using the binary success indicator provided by each benchmark
(e.g., object within target tolerance).
But does not assess behavioral alignment of the policy with the specified instruction.
RDA generates highly task-aligned policies.
On ManiSkill, RDA and Eureka consistently
outperform human-designed rewards, achieving strong alignment and success.
On HumanoidBench, RDA matches Eureka in
success and surpasses in alignment, while human rewards fall short.
Eureka’s lack of visual analysis leads to
mis-specified rewards.
In this demo, we visualize the best RDA rewards on each environment and policies trained using them.
ManiSkill
Humanoid Bench
RDA response shown within code block.
@article{lee2026rda,
title = {Reward Design Agent for Reinforcement Learning},
author = {Hojoon Lee and Ajay Subramanian and Ben Abbatematteo and Vijay Veerabadran and Pedro Matias and Karl Ridgeway and Nitin Kamra},
year = {2026},
journal = {Reinforcement Learning Conference (RLC)}
}