How do you know when research is based on falsified data? That’s one of the challenges faced throughout statistical research. Noble Kuriakose and I authored a paper (forthcoming in the Statistical Journal of the IAOS) that outlines a technique to detect potential fraud in public opinion surveys. So far its merits have been debated on this blog, on Andrew Gelman’s blog, and in the pages of Science. The Pew Research Center has also published several rejoinders to our paper.
We wanted to develop a way to detect a type of potential survey fraud that we suspected was widely unidentified — what we call “near matches.” Near matches occur when the answers of many survey respondents match on nearly every question. This, too, is suspicious, for reasons I will explain below.
Duplicates are suspicious. So are near-duplicates.
Here’s the basic intuition that inspired our method. Organizations that do international survey research perform checks on data quality, including removing duplicate responses, which are cases where the answers of one respondent match those of another respondent for every question. The reason is simple: Most major international survey projects have lengthy questionnaires, making it extremely unlikely that any two individuals will provide the exact same responses to all questions. Instead, a more likely cause of a pair of exact duplicates is that a firm accidently entered responses from the same form twice, or, more perniciously, someone at the local firm hired to do the survey intentionally copied a valid interview to save time or money.
Yet, the likelihood of two individuals answering 99 percent of questions in the same way is, at least in statistical terms, virtually no different than having an exact match. Why then do researchers throw out exact matches but retain those that are extremely near matches? It seemed to us that there were two basic reasons.
First, there was no readily available program to test for near matches, meaning they went undetected. To address this issue, we developed a new Stata program called percentmatch, which is publicly available and free to download. Percentmatch identifies how similar each observation is to its most similar neighbor. Second, there was no proposed standard for how near a match would have to be to indicate possible fraud. We wrote our paper to address this second issue.
» Clic aquí para ver la Nota Completa en inglés