Last September roboticist Benjie Holson of Robust AI posted on Substack the “Humanoid Olympic Games”: a set of increasingly difficult tests for humanoid robots he demonstrated himself while dressed in a silver bodysuit. The challenges, such as opening a door with a round doorknob, started out easy, at least for a human, and progressed to “gold medal” tasks such as buttoning and hanging up a men’s dress shirt and using a key to open a door.
Holson’s point was that the hard tasks aren’t the dazzling ones. Whereas other competitions feature robots playing sports and dancing, Holson argued that the robots we actually want are the ones that can do laundry and cook meals.
He expected years to go by before competing robots could overcome the challenges. Instead, within months a robot created by San Francisco–based company Physical Intelligence completed 11 of the 15 tasks—earning medals from bronze to gold—including washing windows, spreading peanut butter and using a dog-poop bag.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
Scientific American spoke to Holson about why vision-only, or camera-based, systems are outperforming his expectations and how close we are to a genuinely useful machine. He has since released a new, more difficult set of challenges.
An edited transcript of the interview follows.
You designed these tests to be hard. Were you surprised by how quickly the results came in?
It was so much faster than I was expecting. When I chose the challenges, I was trying to calibrate them so some bronze ones would get done in the first month or two, then silver and gold in the next six months, and the most difficult ones might take a year or a year and a half. To have them do basically almost all of them in the first three months is wild.

Dressed as a robot, Benjie Holson demonstrates the silver medal challenge in his proposed Humanoid Olympics. In this challenge, a robot needs to cook and plate a sunny-side up egg.
Benjie Holson
What made that possible?
I started with the premise that we have things that look impressive at a fairly narrow set of tasks—vision-only, no touch, simple manipulator, not incredible precision. That limits what you can be good at. I tried to think of tasks that would require us to break forward out of that set. It turns out I wildly underestimated what’s possible with vision-only robots and simple manipulators.
When I visited Physical Intelligence, I learned that its robots don’t have any force sensing. They’re doing all of that in a 100 percent vision-based way. I thought that the key-insertion task or spreading the peanut butter would require force inputs. But apparently you just throw more video demonstrations at it, and it works.
How exactly do you train a robot to do that without coding it line by line?
It’s all learning from demonstration. Somebody teleoperates the robot doing the task hundreds of times, they train a model based on that, and then the robot can do the task.
There is a lot of confusion about whether large language models (LLMs) are useless for robots. Are they?
I used to think the utility of LLMs in robotics was fairly dubious. The problem they were good at solving two or three years ago was high-level planning—“If I want to make tea, what are the steps?” Ordering the steps is the easy part. Picking up the teapot and filling it is the really challenging thing.
On the other hand, we’ve started doing vision-action models using the same transformer architecture as that used in LLMs. You can use transformers for text in, text out and for images in, text out—but also for images in, robot actions out.
Benjie Holson “wildly underestimated what’s possible with vision-only robots and simple manipulators.”
The neat thing is they’re starting with models pretrained on text, images and maybe video. Before you even start training your specific task, the AI model already understands what a teapot is, what water is, that you might want to fill a teapot with water. So while training for your task, it doesn’t have to start from, “Let me figure out what geometry is.” It can start with, “I see, we’re moving teapots around”—and it is wild that it works.
How did you come up with the “Olympic” tasks?
So part of it was a challenge, and part of it was a prediction. I tried to think of the next set of things that we can’t do now that someone’s going to be able to do soon.
Humans rely on touch to do things such as locating keys in a pocket. How do we get around that in robotics?
That’s a very good question that we don’t know the answer to yet. Touch technology is way worse, more expensive, delicate and far behind cameras. We’ve been working on cameras for a long time.
The big question is: Are cameras enough? Both Physical Intelligence and Sunday Robotics [whose entry completed the bronze-medal task of rolling matched socks] have made the bet that putting a camera on the wrist, very close to the fingers, lets you kind of see forces by seeing how everything smushes. When the robot grabs something, it sees the fingers have some rubber that deflects; the object deflects, and it infers forces from that. When smearing peanut butter on bread, the robot watches the knife deflect down and crush the bread and judges forces from that. It works way better than I expected.
What about safety?
The amount of energy needed to stay balanced is often quite high. If a robot is falling, that’s a very fast, hard acceleration to get the leg in front in time. Your system has to inject a lot of energy into the world—and that’s what’s unsafe.
I’m a huge fan of centaur robots—a mobile wheelbase with arms and a head. For safety, that’s a much easier way to get there quickly. If a humanoid loses power, it’s going to fall down. The general plan seems like it’s to make a robot so incredibly valuable that we as a society create a new safety class for it—like bicycles and cars. They’re dangerous but so valuable that we tolerate the risk.
Have these results changed your timeline?
I used to think home robots were at least 15 years away. Now I think at least six. The difference is that I thought it would be much longer before doing a useful thing in a human space, even as a demo, would be plausible.
But roboticists have seen time and again that there’s a long road between “it worked in a lab, and I got a video” and “I can sell a product.” Waymo was driving on roads in 2009; I couldn’t buy a ride until 2024. It takes a long time to get reliability squared away.
What’s the biggest bottleneck left?
Reliability and safety. The stuff Physical Intelligence shows is incredibly impressive, but if you put it on a different table with different lighting and use a different sock, it might not work. Each step toward generalization seems to take an order of magnitude more data, turning days of data collection into weeks or months.

