In schools across the U.S., multiple-choice questions such as this one provoke anxiety, even dread. Their appearance means it is testing time, and tests are big, important, excruciatingly unpleasant events.

But not at Columbia Middle School in Illinois, in the classroom of eighth grade history teacher Patrice Bain. Bain has lively blue eyes, a quick smile, and spiky platinum hair that looks punkish and pixieish at the same time. After displaying the question on a smartboard, she pauses as her students enter their responses on numbered devices known as clickers.

“Okay, has everyone put in their answers?” she asks. “Number 19, we're waiting on you!” Hurriedly, 19 punches in a selection, and together Bain and her students look over the class's responses, now displayed at the bottom of the smartboard screen. “Most of you got it—John Glenn—very nice.” She chuckles and shakes her head at the answer three of her students have submitted. “Oh, my darlings,” says Bain in playful reproach. “Khrushchev was not an astronaut!”

Bain moves on to the next question, briskly repeating the process of asking, answering and explaining as she and her students work through the decade of the 1960s.

When every student gives the correct answer, the class members raise their hands and wiggle their fingers in unison, an exuberant gesture they call “spirit fingers.” This is the case with the Bay of Pigs question: every student nails it.

“All right!” Bain enthuses. “That's our fifth spirit fingers today!”

The banter in Bain's classroom is a world away from the tense standoffs at public schools around the country. Since the enactment of No Child Left Behind in 2002, parents' and teachers' opposition to the law's mandate to test “every child, every year” in grades three through eight has been intensifying. A growing number of parents are withdrawing their children from the annual state tests; the epicenter of the “opt-out” movement may be New York State, where as many as 90 percent of students in some districts reportedly refused to take the year-end examination last spring. Critics of U.S. schools' heavy emphasis on testing charge that the high-stakes assessments inflict anxiety on students and teachers, turning classrooms into test-preparation factories instead of laboratories of genuine, meaningful learning.

In the always polarizing debate over how American students should be educated, testing has become the most controversial issue of all. Yet a crucial piece has been largely missing from the discussion so far. Research in cognitive science and psychology shows that testing, done right, can be an exceptionally effective way to learn. Taking tests, as well as engaging in well-designed activities before and after tests, can produce better recall of facts—and deeper and more complex understanding—than an education without exams. But a testing regime that actively supports learning, in addition to simply assessing, would look very different from the way American schools “do” testing today.

What Bain is doing in her classroom is called retrieval practice. The practice has a well-established base of empirical support in the academic literature, going back almost 100 years—but Bain, unaware of this research, worked out something very similar on her own over the course of a 21-year career in the classroom.

“I've been told I'm a wonderful teacher, which is nice to hear, but at the same time I feel the need to tell people: ‘No, it's not me—it's the method,’” says Bain in an interview after her class has ended. “I felt my way into this approach, and I've seen it work such wonders that I want to get up on a mountaintop and shout so everyone can hear me: ‘You should be doing this, too!’ But it's been hard to persuade other teachers to try it.”

Then, eight years ago, she met Mark McDaniel through a mutual acquaintance. McDaniel is a psychology professor at Washington University in St. Louis, a half an hour's drive from Bain's school. McDaniel had started to describe to Bain his research on retrieval practice when she broke in with an exclamation. “Patrice said, ‘I do that in my classroom! It works!’” McDaniel recalls. He went on to explain to Bain that what he and his colleagues refer to as retrieval practice is, essentially, testing. “We used to call it ‘the testing effect’ until we got smart and realized that no teacher or parent would want to touch a technique that had the word ‘test’ in it,” McDaniel notes now.

Retrieval practice does not use testing as a tool of assessment. Rather it treats tests as occasions for learning, which makes sense only once we recognize that we have misunderstood the nature of testing. We think of tests as a kind of dipstick that we insert into a student's head, an indicator that tells us how high the level of knowledge has risen in there—when in fact, every time a student calls up knowledge from memory, that memory changes. Its mental representation becomes stronger, more stable and more accessible.

Why would this be? It makes sense considering that we could not possibly remember everything we encounter, says Jeffrey Karpicke, a professor of cognitive psychology at Purdue University. Given that our memory is necessarily selective, the usefulness of a fact or idea—as demonstrated by how often we have had reason to recall it—makes a sound basis for selection. “Our minds are sensitive to the likelihood that we'll need knowledge at a future time, and if we retrieve a piece of information now, there's a good chance we'll need it again,” Karpicke explains. “The process of retrieving a memory alters that memory in anticipation of demands we may encounter in the future.”

Studies employing functional magnetic resonance imaging of the brain are beginning to reveal the neural mechanisms behind the testing effect. In the handful of studies that have been conducted so far, scientists have found that calling up information from memory, as compared with simply restudying it, produces higher levels of activity in particular areas of the brain. These brain regions are associated with the so-called consolidation, or stabilization, of memories and with the generation of cues that make memories readily accessible later on. Across several studies, researchers have demonstrated that the more active these regions are during an initial learning session, the more successful is study participants' recall weeks or months later.

According to Karpicke, retrieving is the principal way learning happens. “Recalling information we've already stored in memory is a more powerful learning event than storing that information in the first place,” he says. “Retrieval is ultimately the process that makes new memories stick.” Not only does retrieval practice help students remember the specific information they retrieved, it also improves retention for related information that was not directly tested. Researchers theorize that while sifting through our mind for the particular piece of information we are trying to recollect, we call up associated memories and in so doing strengthen them as well. Retrieval practice also helps to prevent students from confusing the material they are currently learning with material they learned previously and even appears to prepare students' minds to absorb the material still more thoroughly when they encounter it again after testing (a phenomenon researchers call “test-potentiated learning”).

Hundreds of studies have demonstrated that retrieval practice is better at improving retention than just about any other method learners could use. To cite one example: in a study published in 2008 by Karpicke and his mentor, Henry Roediger III of Washington University, the authors reported that students who quizzed themselves on vocabulary terms remembered 80 percent of the words later on, whereas students who studied the words by repeatedly reading them over remembered only about a third of the words. Retrieval practice is especially powerful compared with students' most favored study strategies: highlighting and rereading their notes and textbooks, practices that a recent review found to be among the least effective.

And testing does not merely enhance the recall of isolated facts. The process of pulling up information from memory also fosters what researchers call deep learning. Students engaging in deep learning are able to draw inferences from, and make connections among, the facts they know and are able to apply their knowledge in varied contexts (a process learning scientists refer to as transfer). In an article published in 2011 in the journal Science, Karpicke and his Purdue colleague Janell Blunt explicitly compared retrieval practice with a study technique known as concept mapping. An activity favored by many teachers as a way to promote deep learning, concept mapping asks students to draw a diagram that depicts the body of knowledge they are learning, with the relations among concepts represented by links among nodes, like roads linking cities on a map.

In their study, Karpicke and Blunt directed groups of undergraduate volunteers—200 in all—to read a passage taken from a science textbook. One group was then asked to create a concept map while referring to the text; another group was asked to recall, from memory, as much information as they could from the text they had just read. On a test given to all the students a week later, the retrieval-practice group was better able to recall the concepts presented in the text than the concept-mapping group. More striking, the former group was also better able to draw inferences and make connections among multiple concepts contained in the text. Overall, Karpicke and Blunt concluded, retrieval practice was about 50 percent more effective at promoting both factual and deep learning.

Transfer—the ability to take knowledge learned in one context and apply it to another—is the ultimate goal of deep learning. In an article published in 2010 University of Texas at Austin psychologist Andrew Butler demonstrated that retrieval practice promotes transfer better than the conventional approach of studying by rereading. In Butler's experiment, students engaged either in rereading or in retrieval practice after reading a text that pertained to one “knowledge domain”—in this case, bats' use of sound waves to find their way around. A week later the students were asked to transfer what they had learned about bats to a second knowledge domain: the navigational use of sound waves by submarines. Students who had quizzed themselves on the original text about bats were better able to transfer their bat learning to submarines.

Robust though such findings are, they were until recently almost exclusively made in the laboratory, with college students as subjects. McDaniel had long wanted to apply retrieval practice in real-world schools, but gaining access to K–12 classrooms was a challenge. With Bain's help, McDaniel and two of his Washington University colleagues, Roediger and Kathleen McDermott, set up a randomized controlled trial at Columbia Middle School that ultimately involved nine teachers and more than 1,400 students. During the course of the experiment, sixth, seventh and eighth graders learned about science and social studies in one of two ways: 1) material was presented once, then teachers reviewed it with students three times; 2) material was presented once, and students were quizzed on it three times (using clickers like the ones in Bain's current classroom).

When the results of students' regular unit tests were calculated, the difference between the two approaches was clear: students earned an average grade of C+ on material that had been reviewed and A− on material that had been quizzed. On a follow-up test administered eight months later, students still remembered the information they had been quizzed on much better than the information they had reviewed.

“I had always thought of tests as a way to assess—not as a way to learn—so initially I was skeptical,” says Andria Matzenbacher, a former teacher at Columbia who now works as an instructional designer. “But I was blown away by the difference retrieval practice made in the students' performance.” Bain, for one, was not surprised. “I knew that this method works, but it was good to see it proven scientifically,” she says. McDaniel, Roediger and McDermott eventually extended the study to nearby Columbia High School, where quizzing generated similarly impressive results. In an effort to make retrieval practice a common strategy in classrooms across the country, the Washington University team (with the help of research associate Pooja K. Agarwal, now at Harvard University) developed a manual for teachers, How to Use Retrieval Practice to Improve Learning.

Even with the weight of evidence behind them, however, advocates of retrieval practice must still contend with a reflexively negative reaction to testing among many teachers and parents. They also encounter a more thoughtful objection, which goes something like this: American students are tested so much already—far more often than students in other countries, such as Finland and Singapore, which regularly place well ahead of the U.S. in international evaluations. If testing is such a great way to learn, why aren't our students doing better?

Marsha Lovett has a ready answer to that question. Lovett, director of the Eberly Center for Teaching Excellence and Educational Innovation at Carnegie Mellon University, is an expert on “metacognition”—the capacity to think about our own learning, to be aware of what we know and do not know, and to use that awareness to effectively manage the learning process.

Yes, Lovett says, American students take a lot of tests. It is what happens afterward—or more precisely, what does not happen—that causes these tests to fail to function as learning opportunities. Students often receive little information about what they got right and what they got wrong. “That kind of item-by-item feedback is essential to learning, and we're throwing that learning opportunity away,” she says. In addition, students are rarely prompted to reflect in a big-picture way on their preparation for, and performance on, the test. “Often students just glance at the grade and then stuff the test away somewhere and never look at it again,” Lovett says. “Again, that's a really important learning opportunity that we're letting go to waste.”

A few years ago Lovett came up with a way to get students to engage in reflection after a test. She calls it an “exam wrapper.” When the instructor hands back a graded test to a student, along with it comes a piece of paper literally wrapped around the test itself. On this paper is a list of questions: a short exercise that students are expected to complete and hand in. The wrapper that Lovett designed for a math exam includes such questions as:

Based on the estimates above, what will you do differently in preparing for the next test? For example, will you change your study habits or try to sharpen specific skills? Please be specific. Also, what can we do to help?

The idea, Lovett says, is to get students thinking about what they did not know or did not understand, why they failed to grasp this information and how they could prepare more effectively in advance of the next test. Lovett has been promoting the use of exam wrappers to the Carnegie Mellon faculty for several years now, and a number of professors, especially in the sciences, have incorporated the technique into their courses. They hand out exam wrappers with graded exams, collect the wrappers once they are completed, and—cleverest of all—they hand back the wrappers at the time when students are preparing for the next test.

Does this practice make a difference? In 2013 Lovett published a study of exam wrappers as a chapter in the edited volume Using Reflection and Metacognition to Improve Student Learning. It reported that the metacognitive skills of students in classes that used exam wrappers increased more across the semester than those of students in courses that did not employ exam wrappers. In addition, an end-of-semester survey found that among students who were given exam wrappers, more than half cited specific changes they had made in their approach to learning and studying as a result of filling out the wrapper.

The practice of using exam wrappers is beginning to spread to other universities and to K–12 schools. Lorie Xikes teaches at Riverdale High School in Fort Myers, Fla., and has used exam wrappers in her AP Biology class. When she hands back graded tests, the exam wrapper includes such questions as:

Based on your responses to the questions above, name at least three things you will do differently in preparing for the next test. BE SPECIFIC.

“Students usually just want to know their grade, and that's it,” Xikes says. “Having them fill out the exam wrapper makes them stop and think about how they go about getting ready for a test and whether their approach is working for them or not.”

In addition to distributing exam wrappers, Xikes also devotes class time to going over the graded exam, question by question—feedback that helps students develop the crucial capacity of “metacognitive monitoring,” that is, keeping tabs on what they know and what they still need to learn. Research on retrieval practice shows that testing can identify specific gaps in students' knowledge, as well as puncture the general overconfidence to which students are susceptible—but only if prompt feedback is provided as a corrective.

Over time, repeated exposure to this testing-feedback loop can motivate students to develop the ability to monitor their own mental processes. Affluent students who receive a top-notch education may acquire this skill as a matter of course, but this capacity is often lacking among low-income students who attend struggling schools—holding out the hopeful possibility that retrieval practice could actually begin to close achievement gaps between the advantaged and the underprivileged.

This is just what James Pennebaker and Samuel Gosling, professors at the University of Texas at Austin, found when they instituted daily quizzes in the large psychology course they teach together. The quizzes were given online, using software that informed students whether they had responded correctly to a question immediately after they submitted an answer. The grades earned by the 901 students in the course featuring daily quizzes were, on average, about half a letter grade higher than those earned by a comparison group of 935 of Pennebaker and Gosling's previous students, who had experienced a more traditionally designed course covering the same material.

Astonishingly, students who took the daily quizzes in their psychology class also performed better in their other courses, during the semester they were enrolled in Pennebaker and Gosling's class and in the semesters that followed—suggesting that the frequent tests accompanied by feedback worked to improve their general skills of self-regulation. Most exciting to the professors, the daily quizzes led to a 50 percent reduction in the achievement gap, as measured by grades, among students of different social classes. “Repeated testing is a powerful practice that directly enhances learning and thinking skills, and it can be especially helpful to students who start off with a weaker academic background,” Gosling says.

Gosling and Pennebaker, who (along with U.T. graduate student Jason Ferrell) published their findings on the effects of daily quizzes in 2013 in the journal PLOS ONE, credited the “rapid, targeted, and structured feedback” that students received with boosting the effectiveness of repeated testing. And therein lies a dilemma for American public school students, who take an average of 10 standardized tests a year in grades three through eight, according to a recent study conducted by the Center for American Progress. Unlike the instructor-written tests given by the teachers and professors profiled here, standardized tests are usually sold to schools by commercial publishing companies. Scores on these tests often arrive weeks or even months after the test is taken. And to maintain the security of test items—and to use the items again on future tests—testing firms do not offer item-by-item feedback, only a rather uninformative numerical score.

There is yet another feature of standardized state tests that prevents them from being used more effectively as occasions for learning. The questions they ask are overwhelmingly of a superficial nature—which leads, almost inevitably, to superficial learning.

If the state tests currently in use in U.S. were themselves assessed on the difficulty and depth of the questions they ask, almost all of them would flunk. That is the conclusion reached by Kun Yuan and Vi-Nhuan Le, both then behavioral scientists at RAND Corporation, a nonprofit think tank. In a report published in 2012 Yuan and Le evaluated the mathematics and English language arts tests offered by 17 states, rating each question on the tests on the cognitive challenge it poses to the test taker. The researchers used a tool called Webb's Depth of Knowledge—created by Norman Webb, a senior scientist at the Wisconsin Center for Education Research—which identifies four levels of mental rigor, from DOK1 (simple recall), to DOK2 (application of skills and concepts), through DOK3 (reasoning and inference), and DOK4 (extended planning and investigation).

Most questions on the state tests Yuan and Le examined were at level DOK1 or DOK2. The authors used level DOK4 as their benchmark for questions that measure deeper learning, and by this standard the tests are failing utterly. Only 1 to 6 percent of students were assessed on deeper learning in reading through state tests, Yuan and Le report; 2 to 3 percent were assessed on deeper learning in writing; and 0 percent were assessed on deeper learning in mathematics. “What tests measure matters because what's on the tests tends to drive instruction,” observes Linda Darling-Hammond, emeritus professor at the Stanford Graduate School of Education and a national authority on learning and assessment. That is especially true, she notes, when rewards and punishments are attached to the outcomes of the tests, as is the case under the No Child Left Behind law and states' own “accountability” measures.

According to Darling-Hammond, the provisions of No Child Left Behind effectively forced states to employ inexpensive, multiple-choice tests that could be scored by machine—and it is all but impossible, she contends, for such tests to measure deep learning. But other kinds of tests could do so. Darling-Hammond wrote, with her Stanford colleague Frank Adamson, the 2014 book Beyond the Bubble Test, which describes a very different vision of assessment: tests that pose open-ended questions (the answers to which are evaluated by teachers, not machines); that call on students to develop and defend an argument; and that ask test takers to conduct a scientific experiment or construct a research report.

In the 1990s Darling-Hammond points out, some American states had begun to administer such tests; that effort ended with the passage of No Child Left Behind. She acknowledges that the movement toward more sophisticated tests also stalled because of concerns about logistics and cost. Still, assessing students in this way is not a pie-in-the-sky fantasy: Other nations, such as England and Australia, are doing so already. “Their students are performing the work of real scientists and historians, while our students are filling in bubbles,” Darling-Hammond says. “It's pitiful.”

She does see some cause for optimism: A new generation of tests are being developed in the U.S. to assess how well students have met the Common Core State Standards, the set of academic benchmarks in literacy and math that have been adopted by 43 states. Two of these tests—Smarter Balanced and Partnership for Assessment of Readiness for College and Careers (PARCC)—show promise as tests of deep learning, says Darling-Hammond, pointing to a recent evaluation conducted by Joan Herman and Robert Linn, researchers at U.C.L.A.'s National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Herman notes that both tests intend to emphasize questions at and above level 2 on Webb's Depth of Knowledge, with at least a third of a student's total possible score coming from questions at DOK3 and DOK4. “PARCC and Smarter Balanced may not go as far as we would have liked,” Herman conceded in a blog post last year, but “they are likely to produce a big step forward.”