When Nintendo's Wii game console debuted in November 2006, its motion-sensing handheld "Wiimotes" got players off the couch and onto their feet. Now Microsoft is trying to outdo its competitor by eliminating the controller altogether: It has revealed details of how it developed Project Natal, which gives Xbox 360 players the ability to manipulate on-screen characters via natural body movements.
The machine-learning technology will enable players to do things such as kick a digital soccer ball or swat a handball in their living rooms simply by mimicking the motion . "Instead of a controller, your body becomes the game input," says Alex Kipman, Microsoft's director of incubation for Xbox 360.
Microsoft introduced its ambitious Xbox upgrade in June 2009 and expects to ship the technology in time for the year-end 2010 holiday season. Natal will consist of a depth sensor that uses infrared signals to create a digital 3-D model of a player's body as it moves, a video camera that can pick up fine details such as facial expressions, and a microphone that can identify and locate individual voices.
Programming a game system to discern the human body's almost limitless combinations of joint positions is a fearsome computational problem. "Every single motion of the body is an input, so you'd need to program near infinite reactions to actions," Kipman says.
Instead of trying to preprogram actions, Microsoft decided to teach its gaming technology to recognize gestures in real time just like a human does: by extrapolating from experience. Jamie Shotton, a researcher at Microsoft Research Cambridge in England, devised a machine learning algorithm for that purpose. It also recognizes poses and renders them in the game space on-screen at 30 frames per second, a rate that conveys smooth movement. Essentially, Natal-enhanced Xboxes will do motion capture on the fly, without the need for the mirror-studded spandex suit of conventional motion-capture approaches.
Training Natal for this task required Microsoft to amass a large amount of biometric data. The firm sent observers to homes around the globe, where they videotaped basic motions such as turning a steering wheel or catching a ball, Kipman says. Microsoft researchers later laboriously selected key frames within this footage and marked each joint on each person's body. Kipman and his team also went into a Hollywood motion-capture studio to gather data on more acrobatic movements.
"During training, we need to provide the algorithm with two things: realistic-looking images that are synthesized and, for each pixel, the corresponding part of the body," Shotton says. The algorithm processes the data and changes the values of different elements to achieve the best performance.
To keep the amount of data manageable, the team needed to figure out which elements were most relevant for training. For example, the system doesn't need to recognize the entire body mass, but only the spacing of skeletal joints. After whittling down the data to the essential motions, the researchers mapped each unique pose to 12 models representing different ages, genders and body types.