MolmoBot: Training AI Robots in Simulation Instead of the Real World

Synthetic simulation data is becoming the secret weapon driving physical AI forward in enterprise settings. And MolmoBot from Ai2 is leading the charge with a genuinely fresh approach to robot learning.

Historically, getting robots to interact with the real world meant relying on human-operated demonstrations—a process that's expensive, time-consuming, and frankly, doesn't scale. Most companies building general-purpose robot systems have treated real-world data collection as the foundation for AI training. What's interesting here is that this approach has created a major bottleneck.

Consider the DROID project, which gathered roughly 76,000 remote control trajectories across 13 different institutions—equivalent to about 350 hours of human labor. Then there's Google DeepMind's RT-1, which required 130,000 robot experiments collected over 17 months by technicians. This dependency on proprietary, manually-collected datasets has exploded research costs and concentrated cutting-edge robotics work among a handful of well-funded industrial labs.

Ali Farhadi, CEO of the Allen Institute for AI (Ai2), frames the mission differently. He wants to build AI systems that accelerate scientific discovery and expand human capability. In his view, robots should become foundational scientific instruments—tools that help researchers move faster and ask better questions. But that only works if the underlying AI systems can generalize to real-world conditions and if the tools are shared openly across the research community. Proving that simulation training transfers to real tasks is a crucial step in that direction.

The Ai2 research team proposed a different economic model with MolmoBot: a suite of robot control models trained entirely on synthetic data. Rather than having humans teleoperate robots to collect data, they automatically generated movement trajectories within a simulation environment called MolmoSpaces.

The accompanying dataset, MolmoBot-Data, contains approximately 1.8 million expert-level manipulation trajectories. They created this by combining the MuJoCo physics engine with domain randomization—randomly varying objects, camera angles, lighting, and dynamics to create diverse simulation environments.

Ranjay Krishna, who leads the PRIOR team at Ai2, explains that most current approaches try to narrow the sim-to-real gap by adding more real-world data. The Ai2 team bet the opposite direction: you can shrink that gap by dramatically expanding the diversity of simulated environments, objects, and camera conditions. The real insight here is shifting the industry's focus from manual data collection to designing better virtual worlds—a problem technology can actually solve.

To generate the simulation data, the team deployed 100 Nvidia A100 GPUs. The system produces roughly 1,024 experiments per GPU-hour, equivalent to over 130 hours of robot experience compressed into a single hour of real time.

Compared to real-world data collection, this approach boosts throughput by nearly four times, significantly reducing development cycles and improving return on investment for robotics projects.

MolmoBot consists of three distinct control policies and was tested on two hardware platforms: the Rainbow Robotics RB-Y1 mobile robot and the Franka FR3 robotic arm mounted on a table. The primary model uses the Molmo2 vision-language foundation, processing multiple RGB frames alongside natural language instructions to decide robot actions.

For edge computing environments with limited resources, the team also provides MolmoBot-SPOC, a lightweight transformer with fewer parameters. There's also MolmoBot-Pi0, which uses the PaliGemma architecture similar to Physical Intelligence's π0 model, enabling direct performance comparisons.

In real-world tests, these models transferred to physical tasks without additional fine-tuning—even when handling objects or environments that never appeared in the training data.

On a pick-and-place task, MolmoBot achieved a 79.2% success rate. That beats π0.5, which was trained on massive real-world datasets but only managed 39.2%. On mobile manipulation tasks, the robot successfully completed actions like approaching objects, grasping door handles, and pulling doors fully open.

Offering multiple architectures lets organizations integrate powerful physical AI without being locked into a single proprietary vendor or complex data infrastructure.

The entire MolmoBot ecosystem—training data, data generation pipelines, and model architectures—is released as open source. This lets organizations validate, customize, and deploy physical AI systems with controlled costs.

Farhadi stresses that for AI to genuinely advance science, progress can't depend on closed datasets or isolated systems. What we need is shared infrastructure where researchers worldwide can build, test, and improve together. That's the path forward for physical AI to thrive in the years ahead.

Related Articles