A young child can look at whatever is in front of them, and describe what they see—but for artificial intelligence systems, that’s a daunting task. That’s because it combines two separate skills: the ability to recognize objects as well as generate sentences describing the scene. Scientists at the University of Toronto and the University of Montreal have developed software, modeled on brain cell networks, that they claim can take any image and generate a caption and get it right—most of the time.

Their approach builds on earlier work involving natural language processing—the ability to convert speech or text from one language to another—or, more generally, to extract meaning from words and sentences. “It’s about both the combination of image information with natural language,” says Richard Zemel, a computer scientist at the University of Toronto. “That’s what’s new here—the marriage of image and text. We think of it as a translation problem,” he notes. “When you’re trying to translate a sentence from English to French, you have to get the meaning of the sentence in English first and then convert it to French. Here, you need the meaning, the content of the image; then you can translate it into text.”

But how does the software model “know” what’s in the image in the first place? Before the system can process an unfamiliar picture, it is trained on a massive data set—actually three different data sets containing more than 120,000 already captioned images. The model also needs to have some sense of what words are likely to be found alongside other words, in normal English sentences. For example, an image that causes the model to generate the word “boat” is likely to also use the word “water,” because those words usually go together. Moreover, it has some idea of what’s important in an image. Zemel points out, for example, if an image has a person in it, the model tends to mention that in the caption.

Often, the results are dead-on. For one image, it generated the caption, “A stop sign on a road with a mountain in the background”—just as the image showed; it was also accurate for, “A woman is throwing a Frisbee in a park,” and “A giraffe standing in a forest with trees in the background.” But occasionally it stumbles. When an image contained two giraffes near one another but far from the camera, it identified them as “a large white bird.” And a vendor behind a vegetable stand yielded the caption, “A woman is sitting at a table with a large pizza.” Sometimes similar-looking objects are simply mistaken for one another—a sandwich wrapped in tinfoil can be misidentified as a cell phone, for example (especially if someone is holding it near their face). In their tests, Zemel says, the model came up with the captions that “could be mistaken for that of a human” about 70 percent of the time.

One potential application might be as an aid for the visually impaired, Zemel says. A blind person might snap a photo of whatever’s in front of them and ask the system to produce a sentence describing the scene. It could also help with the otherwise-laborious task of labeling images. (A media outlet might want to instantly locate all archival images of, say, children playing hockey or cars being assembled in a factory—a daunting task if the thousands of images on one’s hard drive haven’t been labeled.)

Credit: Microsoft COCO/Kelvin Xu et al.

Is the model thinking? “There are analogies to be made between what the model is doing and what brains are doing,” Zemel says, particularly in terms of representing the outside world and in devoting “attention” to specific parts of a scene. “It’s getting toward what we’re trying to achieve, which is to get a machine to be able to construct a representation of our everyday world in a way which is reflective of understanding.”

Zemel and his colleagues will be presenting a paper on the work, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” at the International Conference on Machine Learning in July.