Editor's note: The online version of this story was posted on January 7.
When Nintendo’s Wii game console debuted in November 2006, its motion-sensing handheld “Wiimotes” got players off the couch and onto their feet. Now Microsoft hopes to outdo its competitor by eliminating the controller altogether: this past January it revealed details of Project Natal, which will give Xbox 360 users the ability to manipulate on-screen characters via natural body movement. The machine-learning technology will enable players to kick a digital soccer ball or swat a handball simply by mimicking the motion in their living room.
Microsoft, which announced its ambitious Xbox upgrade plan in June 2009, has not set a release date, but many observers expect to see Natal at the end of the year. It will consist of a depth sensor that uses infrared signals to create a digital 3-D model of a player’s body as it moves, a video camera that can pick up fine details such as facial expressions, and a microphone that can identify and locate individual voices.
Programming a game system to discern the almost limitless combinations of joint positions in the human body is a fearsome computational problem. “Every single motion of the body is an input, so you’d need to program near-infinite reactions to actions,” explains Alex Kipman, Microsoft’s director of innovation for Xbox 360.
Instead of trying to preprogram actions, Microsoft decided to teach its gaming technology to recognize gestures in real time, just like a human does: by extrapolating from experience. Jamie Shotton of Microsoft Research Cambridge in the U.K. devised a machine-learning algorithm for that purpose. It also recognizes poses and renders them in the game space on-screen at 30 frames per second, a rate more than sufficient to convey smooth motion. Essentially, a Natal-enhanced Xbox will capture movement on the fly, without the need for the mirror-studded spandex suit of conventional motion-capture approaches.
Training Natal for the task has required Microsoft to amass a large amount of biometric data. The firm sent observers to homes around the globe, where they videotaped basic motions such as turning a steering wheel or catching a ball, Kipman says. Microsoft researchers later laboriously selected key frames within this footage and marked each joint on each person’s body. Kipman and his team also went into a Hollywood motion-capture studio to gather data on more acrobatic movements.
“During training, we need to provide the algorithm with two things: realistic-looking images that are synthesized and, for each pixel, the corresponding part of the body,” Shotton says. The algorithm processes the data and changes the values of different elements to achieve the best performance.
To keep the amount of data manageable, the team had to figure out which were most relevant for training. For example, the system doesn’t need to recognize the entire mass of a person’s body, but only the spacing of his or her skeletal joints. After whittling down the data to the essential motions, the researchers mapped each unique pose to 12 models representing different ages, genders and body types.
The end result was a huge database consisting of frames of video with people’s joints marked. Twenty percent of the data was used to train the system’s brain to recognize movements. Engineers are keeping the rest in a “ground truth” database used to test Natal’s accuracy. The better the system can recognize gestures, the more fun it will be to play the game.
Of course, Microsoft is not the only company exploring gestural interfaces. Last May, Sony demonstrated a prototype unit that relies on stereo video cameras and depth sensors that, it says, could be used to control a computer cursor, game avatar or even a robot. Canesta, a company that makes computer-vision hardware, has demonstrated a system that lets couch potatoes control the TV with a wave of the hand and has partnered with computer manufacturers Hitachi and GestureTek to create gestural controls for PC applications.