A wealth of data suggests that averaging the answers of many people often outperforms any individual, even expert, opinion. The wisdom of the crowd is far from infallible though—in cases where specialized knowledge is needed, even if the crowd contains experts, they are drowned out by majority ignorance. But a study published today in Nature, led by behavioral economist Drazen Prelec of Massachusetts Institute of Technology, presents a new method that can extract correct answers from a crowd even when the majority opinion is wrong.

The most frequently quoted example of the crowd wisdom phenomenon comes from a 1987 study in which researchers asked 56 students to estimate the number of jelly beans in a jar. The average of the guesses (871) was closer to the true number (850) than all but one of the individual guesses. This approach doesn’t work in all cases, however.

Previous research aimed at improving accuracy has often involved obtaining confidence ratings. Giving more weight to higher confidence answers can increase accuracy, but still fails in some situations, such as when deliberately misleading questions are used. For example, this new study shows that, when asked if Philadelphia is the capital of Pennsylvania, most people incorrectly answer “yes” because they know it is a large, historically significant city in Pennsylvania, even though the correct answer is Harrisburg. Confidence ratings don't solve this problem, as people are often as confident in an incorrect answer as the correct one. “Conceptually there's something missing in confidence,” Prelec says. “You want people to express whether their information draws on common knowledge or not—it's really how confident they are that they have unique information.”

The team devised a clever yet simple solution they call the “surprisingly popular” method. In addition to providing answers and confidence ratings, they asked participants to predict how others would respond. They show that selecting the answer that is more popular than predicted out-performs both “most popular” and “most confident” methods. Both the misguided majority and the correct minority predict all people will give the incorrect response, so the minority (but correct) response is given much more often than predicted. “Minorities can be wildly off-bat, but there are many situations where you have a hierarchy of knowledge, and the people with more knowledge often know other people won't share their information,” Prelec explains. “In the majority of cases the expert knows what the non-expert will think.” Selecting the “surprisingly popular” response is thus more accurate in these situations. “The genius [of this approach] is you let a more knowledgeable minority reveal itself through predictions that the majority of people will disagree with them,” says Stefan Herzog, a researcher in decision-making at the Max Planck Institute for Human Development in Berlin, who was not involved in the research.

The study focused on binary yes-or-no questions, in four different settings. The first experiment included 50 questions about US state capitals. The second used 80 true or false trivia questions, selected to include both questions most would answer correctly and those the majority would get wrong. The third presented dermatologists with a selection of 80 images of skin lesions and asked them to predict both their confidence about whether each was benign or malignant, as well as the distribution of other dermatologists' judgments. The final experiment asked both a group of art experts and a group of MIT students who had not taken art courses to judge the market value of 90 reproductions of 20th century works of art. Four value ranges were given, and participants were also asked to estimate the percentage of people predicting a value above $30,000.

In all cases the new method performed better than majority or confidence-based methods alone, reducing errors by between 21 and 35 percent. “There's one key idea here, which is to ask people how many people they think would agree with them,” says cognitive scientist Michael Lee of the University of California, Irvine, who was also not involved in the work. “That seems a clever way to do things and the results are pretty compelling.” In the experiment involving dermatologists, although the new method performed best, the difference was not statistically significant, most likely because all the participants were experts, reducing the crowd’s range of knowledge. “We wanted to go into more interesting domains, and make the challenge hard,” Prelec notes.

The “wisdom of crowds” is usually understood as a statistical rather than psychological phenomenon, often explained by analogy to physical systems involving a noisy signal. Answers are disturbed from the “signal” (truth) by unrelated (statistically independent) errors, which thus cancel out when averaged, so the mean is close to accurate. Some researchers have even found that if participants are allowed to communicate, it reduces crowd performance, presumably because errors are then no longer unrelated.

But this metaphor overlooks the fact that the “system” is composed of people. “The other model is that what's happening is not a question of noise, it's a question of there being some bits of evidence that are widely known and others that are concentrated in small groups—that moves away from physical science notions, to a cultural one,” Prelec says. “A lot of statistics in crowd methods have treated people like physical particles, but we're asking the crowd to reflect on what it knows. That's not something particles can do.” Lee's group approaches crowd wisdom research as a cognitive modeling problem. “The data are produced by people, [and are] critically sensitive to things like individual differences and expertise, which are fundamentally psychological concepts,” Lee says. “Something I'd like to see next is, Can you improve this further by [an] understanding of how people produce these kinds of judgments?”

The study also includes theoretical analyses that extend the method to multiple-choice situations, but whether it works in more complex settings, such as estimates or rank orderings, remains an open question. “This [approach] could potentially apply to all sorts of human judgments,” Lee says. “There's now a lot of work to do to see how powerful and general it is.”

The work may have immediate real-world applications. Herzog worked on a study published last year using “collective intelligence” to improve breast and skin cancer diagnoses. “It could be applied in the emerging field of teledermatology, by combining the opinions of multiple diagnosticians,” he says. “In principle, it could be applied everywhere we use majority voting, where it's feasible to ask people not only for their own decision, but also what proportion of people they think will agree.”

The longer-term goal is to be able to produce good estimates for questions without known, well-defined answers. “The real test of this would be some question like who's going to win the US presidential election, or a sports match; I'd be keen to see how this performs.” Lee says. “It's an interesting open question, whether they're fundamentally different, or just more challenging.” Prelec is optimistic: “The assumption has been these are two different categories of question, but the reasoning we do is very similar in the two domains,” he says. “The strategy is to fine-tune your method on questions where you can verify the answer, then make the leap of faith and assume this is the best you can do with non-verifiable questions.”