In the 19th century, French clinician Guillaume-Benjamin-Amand Duchenne posited that humans universally use their facial muscles to make at least 60 discrete expressions, each reflecting one of 60 specific emotions. Charles Darwin, who greeted that number with some skepticism, was invested in exploring the universality of facial expressions as evidence of humanity’s common evolutionary history. He ended up writing a book about human expressions, leaning heavily toward the idea that at least some were common across all cultures.

Since these early forays in the field, debate has run hot over whether some faces we make are common to all of us and, if so, how many there are. Duchenne settled on 60, whereas beginning in the 1970s, psychologist Paul Ekman famously characterized six (disgust, sadness, happiness, fear, anger, surprise) in a construct that has held sway for decades.

A new study published December on 16 in Nature takes things a step further and lands on yet another tally of universal facial expressions, this time based on millions of moving images instead of the small number of still photographs employed in earlier studies. Using human ratings of 186,744 videos showing people reacting in different situations, the authors trained a neural network to label facial expressions from a list of emotion labels, such as awe, confusion and anger. With this training, the neural network assessed six million videos originating from 144 countries and consistently linked similar facial expressions to similar social contexts across 12 regions of the globe. For example, a facial reaction labeled “triumph” was often associated with sporting events regardless of geographical region, suggesting a universal response in that context.

Although the results imply that how we move our faces in certain situations may be common across cultures, they do not address whether these expressions accurately signal the internal experience of an emotion. A number of factors could have affected the results: the limitations of machine learning, the use of only English-speaking raters in India for training the algorithm and the potential for some misinterpretations of the findings raise concerns.

Using video and considering context is “absolutely a step forward” for the field, says Lisa Feldman Barrett, a professor and psychologist at Northeastern University, who wrote an accompanying commentary on the study. “The question that they’re asking strikes right at the heart of the nature of emotion,” she remarks. But there’s a risk that such information might be used to judge people, which “would be premature because there are multiple ways to interpret these results.”

The study’s first author Alan Cowen, a researcher at the University of California, Berkeley, and a visiting faculty researcher at Google, agrees, saying the use of machine learning for studying the physiology of emotion has just begun. “It’s early days for sure, and this is pretty nascent,” he says. “We’re just focused on whether and how machine learning can help researchers answer important questions about human emotions.”

The machine doing the learning in this case was a deep neural network, which takes input, such as a video clip, and parses it through many layers to predict what the input material contains. In this case, the neural network tracked the movement of faces in videos and labeled the facial expressions in different social situations. But first, it had to learn to apply various labels that human raters associated with specific facial configurations.

To teach the network, Cowen and his colleagues needed a large repository of videos rated by human viewers. A team of English-speaking raters in India completed this task, making 273,599 ratings of 186,744 YouTube video clips lasting one to three seconds each. The research team used the results to train the neural network to classify patterns of facial movement with one of 16 emotion-related labels, such as pain, doubt or surprise.

The scientists then had another neural network analyze visual cues in three million videos from 144 countries to assign a social context to each video, from weddings to weight lifting to witnessing fireworks, ultimately characterizing 653 contexts.

They then tested the facial expression network on these three million videos, assessing how consistently it assigned specific facial expression labels in similar social contexts, such as “joy” when seeing a toy. The results showed a similar pattern of associations across 12 regions around the globe. For example, regardless of region, the neural network tended to associate facial expressions labeled “amusement” most often with contexts labeled “practical joke,” and the “pain” expression label was consistently associated with uncomfortable contexts such as weight lifting.

To rule out an influence of faces in those three million videos on the assignment of social context, the researchers had another network assign context using only the words in labels and descriptions accompanying videos. That network took on another three million videos and assigned 1,953 social contexts. When the facial expression network applied the 16 labels to these videos, there was a similar but slightly weaker consistency between the label for a facial expression and the context assigned the video. Cowen said that this outcome was expected because context from language is a lot less accurate than context assignments based on video, “which starts to illustrate what happens when you rely too much on language.”

When they compared results across regions, Cowen and his colleagues found some instances of greater shared similarities in geographically adjacent regions, though there was some regional variation. Africa was most similar to the nearby Middle East in its expression-context associations and less similar to the farther away India.

Yet on average, each individual region was similar to the average across all 12 regions, Cowen says—often more so than to any of its direct neighbors.

The caveat is that these facial expressions do not provide a readout of emotion or intention, asserts Jeffrey Cohn, a professor of psychology at the University of Pittsburgh, who also peer-reviewed the paper. “They’re related in context but that’s a long way from making inferences about the meaning of any particular expression. There is no one-to-one mapping between facial expression and emotion.”

Cowen confirmed that “we don’t know what somebody’s feeling based on their facial muscle movements, and we’re not claiming to infer that.” For example, “the same facial expression used in sporting events in one culture versus another might be associated with a more positive or more negative emotional experience.”

Machine learning is a powerful technique, Barrett acknowledges, but it has to be used with caution. “You have to be careful not to encode the beliefs of the human raters in these models and where beliefs will influence training,” she says. “No matter how fancy schmancy the modeling is, it’s not going to protect you from the weaknesses that creep in when there’s human inference involved.” She notes, for example, that the human raters of the videos used to train the facial expression algorithm were all from the same culture and region and constrained to using a list of labels that themselves are emotion words, such as “anger,” rather than descriptions, such as “scowl.”

Casey Fiesler, an assistant professor of information science at the University of Colorado Boulder, who was not involved in the work, expressed concern about the possibility of bias from applying the four categories of race that authors used to evaluate an influence of that factor. “There’s a huge body of literature that speaks to, for example, implicit bias in judging facial expressions of people of different races,” she pointed out.

Used in the wrong ways, assumptions about the universality of facial expression can cause harm to people who are marginalized or vulnerable economically, says Nazanin Andalibi, assistant professor in the school of information science at the University of Michigan. As an example from earlier facial-recognition applications, she says, “no matter how much someone smiles, some algorithms continue to associate negative emotions with Black faces, so there are a lot of individual-level harms.”

The technology also offers some potential benefits, Cohn says, such as recognizing facial expression cues in people at risk for suicide. The relevance of context in this work is a major step in that direction, he adds. “I am not saying we can go out on the street and detect whether someone is depressed,” Cohn says, “but within a specific context like a clinical interview, we could measure severity of depression.”

Efforts to use the technology moving forward—for suicide prevention or other uses—require attention to the various real-world pitfalls that are possible when an algorithm makes a judgment about people. “Machine-learning techniques are cool and really useful, but they are not a magic bullet,” Barrett says. “This is not just an esoteric debate within the walls of the academy or the walls of Google.”