Companies and individuals are often at odds, concerned either with collecting information or with preserving privacy. Online stores and services are always eager to know more about their customers—income, age, tastes—whereas most of us are not eager to reveal much.
Math suggests a way out of this bind. A few years ago Rakesh Agrawal and Ramakrishnan Srikant, both data-mining researchers, developed an idea that makes telling the truth less worrisome. The idea works if companies are content with accurate aggregate data and not details about individuals. Here is how it goes: you provide the numerical answer to certain intrusive online questions, but a random number is added to (or subtracted from) it, and only the sum (or difference) is submitted to the company. The statistics needed to recover approximate averages from the submitted numbers is not that difficult, and your privacy is preserved.
Thus, say you are 39 and are asked your age. The number sent to the site might be anywhere in the range of 19 to 59, depending on a random number between –20 and +20 that is generated (by the company if you trust it, by an independent site or by you). Similar fudge factors would apply to incomes, zip codes, years of schooling, size of family, and so on, with appropriate ranges for the generated random number.
Another, older example from probability theory illustrates a variant of the idea. Imagine you are on an organization’s Web site, and the organization wishes to find out how many of its subscribers have ever X-ed, with X being something embarrassing or illegal. Not surprisingly, many people will lie if they answer the question at all. Once again, random masking comes to the rescue. The site asks the question, “Have you ever X-ed? Yes or no,” but requests that before answering it, you privately flip a coin. If the coin lands heads, the site requests that you simply answer yes. If the coin lands tails, you are instructed to answer truthfully. Because a yes response might indicate only a coin’s landing heads, people presumably would have little reason to lie.
The math needed to recover an approximation of the percentage of respondents who have X-ed is easy. To illustrate: if 545 of 1,000 responses are yes, we would know that about 500 of these yesses were the result of the coin’s landing heads because roughly half of all coin flips would, by chance, be heads. Of the other approximately 500 people whose coin landed tails, about 45 of them also answered yes. We conclude that because 45 or so of the approximately 500 who answered truthfully have X-ed, the percentage of X-ers is about 45/500, or 9 percent.
In some situations, variants of this low-tech technique, in conjunction with appropriate legislation, would work—or so thinks this 6′9″ X-er.
This article was originally published with the title No X-aggeration.
Already a Digital subscriber? Sign-in Now
If your institution has site license access, enter here.




See what we're tweeting about




7 Comments
Add CommentAll of these methods seem to require that individuals trust companies to use these methods to prevent the collection of individualized personal data. If companies were trustworthy, they would simply not collect individualized personal information! Trust me!
Reply | Report Abuse | Link to thisThe Failure Of Averages.
Reply | Report Abuse | Link to thisHere is why that idea of adding or subtracting random numbers to protect individual privacy fails.
The value of collected data about individuals is rarely in the average of any metric.
For example, what marketers in particular (and also social scientists) value most is correlation. How is one like or activity associated with another like or activity? Your proposed method of adding random variations to answers within a survey completely removes these correlations.
Your solution also removes the ability to determine trends and cause-and-effect relationships, which are often of even more economic value. (Who can’t make money off of predicting the future?)
(I note in passing that correlation and cause-and-effect are regularly confuses by the media.)
As another example, there is often value in the distribution of metric, not in its average. For example, a high school may have a “bi-modal” distribution of student scores due to having two distinct populations. Both the students and the school behave substantially differently in this case compared to a Gaussian distribution.
For examples, the extensive use of honors classes may be liked in a bi-modal school whereas this educational approach may be considered “elitist” in a normal school. Also, student assault rates can vary significantly based on the shape of distribution, perhaps due to the ability of students from each population cluster to find a comfortable peer group, rather than primarily competitive pressure.
As another example, consider housing markets. Different segments of a housing market, even in one geographic area, can have completely different characteristics. Entry-level houses may be lying vacant while high-end homes are being snatched up by eager buyers. There are often multiple striations within a housing market, that averages, including the oft-quoted “median price” of homes completely hides.
A third failure has to do with “outliers.” One application is to identify accurately the extremes of a distribution, such as the lowest or highest 2%. Adding random numbers to answers changes both the position and density of these outliers. Anther use of outliers is to throw away their results due to assumptions about errors in recording—either intentional or accidentally.
Thus, except for a very small number of surveys—typically those already known to have a Gaussian distribution—this proposed solution lacks economic value.
I don't see how companies wanting information is sufficient for them being untrustworthy. The response from cow_duo is a very good reason why this statistical trick isn't necessarily going to effect stores, the example given, but it isn't due to some shadowy cabal just turning down the idea because their plans to steal everyone's information don't work well with it. The dream for a store is to direct-market, get word out about a product to those who are the most likely to buy it. It, simplified, provides the biggest return for the investment. If you buy something through Amazon, they will then keep a record of your purchases, and your perusals, and based off that list suggest other purchases you might like. If the company takes more information, such as age or geographic location, they may start trying to data mine to start suggesting people who are "like" you, that they may enjoy these items.
Reply | Report Abuse | Link to thisIf you have evidence that they're doing something, that's fine, but otherwise it's just unfounded paranoia.
TRUST NO ONE! I would lie no matter what, even if the damn coin landed on its edge. Consider that your transmission could be traced to you, then it is on record that in response to the question you have affirmed yourself an X-er and you have no way of proving the coin came up heads. Hell, how do you prove a coin toss was ever required at all, once the original instructions have been deleted?
Reply | Report Abuse | Link to thiswhen sellers have more information than buyers then the market is rigged in the favor of the sellers. the sellers have no real need to collect personaly identifying information to gain this advantage. the only gain from collecting such personaly identifying information is the ability to do targeted marketing. actual sellers (market participants) have no incentives to steal identities, though the existence of databases with such information makes a tasty treat for criminals (non market participants) to steal. the ability of sellers to protect such databases is demonstrably poor.
Reply | Report Abuse | Link to thisI live by 2 basic philosophical rules.
Reply | Report Abuse | Link to this1) Trust everyone but only with what I am willing to lose.
2) A fair fight is the one that I win and suffer minimal losses from while obliterating my opponent so that they will never be capable of being a threat again. There is no other criteria for a fight.
This can be applied to everything including sales and marketing and from buyer or seller perspectives.
"A fair fight is the one that I win and suffer minimal losses from while obliterating my opponent so that they will never be capable of being a threat again. There is no other criteria for a fight."
Reply | Report Abuse | Link to thisYou concept is neither fair, nor moral. But it is a fair representation of the attitudes and practices of the modern MBA's and corporate managers. An attitude that has destroyed our economy and is working to destroy the fabric of the nation.