The Allen Institute for AI (Ai2) on Tuesday unfurled a new artificial intelligence (AI) model that lets robots reason about their surroundings in 3-D before taking action.

The model, called MolmoAct 7B, takes a radically different approach to robotic decision-making by bringing structured AI reasoning into the physical world, the company said. It is trained entirely on open data. Its step-by-step visual reasoning makes it easy to preview what a robot plans to do and intuitively steer its behavior in real time as conditions change.

MolmoAct 7B, the first in its model family, was trained on a curated dataset of about 12,000 “robot episodes” from real-world environments, such as kitchens and bedrooms.

Rather than reason through language and convert that into movement, MolmoACT views its surroundings, understands the relationships between space, movement and time, and plans its movements.

It does so by generating visual reasoning tokens that transform 2-D image inputs into 3-D spatial plans, allowing robots to navigate the physical world with greater intelligence and control.

While spatial reasoning isn’t new, most modern systems rely on closed, end-to-end architectures trained on massive proprietary datasets. These models are difficult to reproduce, expensive to scale, and often operate as opaque black boxes, according to Ai2.

“Embodied AI needs a new foundation that prioritizes reasoning, transparency, and openness,” Ai2 CEO Ali Farhadi said in a statement. “With MolmoAct, we’re not just releasing a model; we’re laying the groundwork for a new era of AI, bringing the intelligence of powerful AI models into the physical world. It’s a step toward AI that can reason and navigate the world in ways that are more aligned with how humans do — and collaborate with us safely and effectively.”

MolmoAct is the first in a new category of AI model that Ai2 calls an Action Reasoning Model (ARM), which interprets high-level natural language instructions and reasons through a sequence of physical actions to carry them out in the real world.

ARM interprets high-level instructions and break them down into a transparent chain of spatially grounded decisions.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Tech Field Day Events

TECHSTRONG AI PODCAST

SHARE THIS STORY