Dex AI

The Physical Copilot

Zero demonstrations. Zero fine-tuning. Zero-shot.

This is what a general-purpose robot looks like: you simply tell it what you want it to do, and it figures it out.

Open-World Manipulation

For this task, Dex is asked (verbally) to put the orange ball in the box.

It was never trained on (or programmed to recognize) any of the objects in this scene. The entire environment is completely new.

Yet, given the language prompt "put the orange ball in the box", it correctly identifies and locates the box and the orange ball. It does this despite the presence of confounding objects like the apple with a very similar color, the orange and the other ball. It also chooses the right manipulation skills to execute (pick and place). These capabilities are the result of using models pretrained on internet-scale data.

1X. Autonomous. Open-World.

Multi-Step Interactive Task

Dex can do many things purely from language instructions, including play games like Tic-Tac-Toe .

The robot was not programmed to play, and was never trained to play either. It has never seen these pieces and this board before. It does not have any computer-vision code specific to this task.

The only task-specific bit in the entire system is a paragraph of text (in English) explaining the game, the setup and the available motion primitives. This text is provided as context to the LLM.

This is all done with large pretrained vision and language models (VLM), without any adaptation or fine-tuning.

1X. Autonomous. Open-World.

Grounded Conversation (Speech-to-Speech)

Here, Dex is asked to describe a new scene, and then to place one candy on each plate.

Even though it sees the scene and this type of objects for the first time, it is able to detect them, reason about them and manipulate them.

The combination of speech-to-speech language and open-world manipulation creates a completely magical experience, enabling the kind of fluid and intuitive interaction only seen in movies

1X. Autonomous. Open-World.

Grounded Reasoning

This is a sorting task, specified ambiguously: put the parts in the right box.

Dex correctly reasons about the ask, placing the screws in the box with screws and the washers in the box with washers.

This is a long-horizon, multi-step task. Because Dex decomposes it in discrete steps, the lenght of the task does not affect performance.

The objects, the scene and the task are completely new to the robot. Dex was not trained for this task.

1X. Autonomous. Open-World.

Recovery From Errors

Dex needs to put the lego pieces scattered on the table in the box.

The robot fails to grasp the last two legos, but detects the error and auto-corrects.

New task, new objects, new setting: Dex is seeing all this and hearing this language instruction for the first time.

1X. Autonomous. Open-World.