Bad actors are a key threat to validity that cannot be controlled directly through better survey design. That is, unlike the case of bias in wording or order, we cannot eliminate bugs through the survey design. What we can do is use the design to make it easier to identify these adversaries.
Bots are computer programs that fill out surveys automatically. We assume that bots have a policy for choosing answers that is either completely independent of the question, or is based upon some positional preference.
No positional preference A bot that chooses responses randomly is an example of one that answers questions independently from their content.
Positional preference A bot that always chooses the first question or always chooses the last question, or alternates positions on the basis of the number of available choices: for example, “Christmas tree-ing” a multiple choice survey.
We define a lazy respondent as a human who is behaving in a bot-like way. In the literature these individuals are called spammers and according to a study from 2010, almost 40% of the population sampled failed a screening task that only required basic reading comprehension. There are two key differences between human adversaries and software adversaries : (1) we hypothesize that individual human adversaries are less likely to choose responses randomly and (2) that when human adversaries have a positional preference, they are more likely make small variations in their otherwise consistent responses. Regarding (1), while there is no end to the number of studies and amount of press devoted to humans’ inability to identify randomness, there has been some debate over whether humans can actually generate sequences of random numbers. Regarding (2), while a bot can be programmed to make small variations in positional preference, we believe that humans will make much more strategic deviations in their positional preferences.
Both humans and bots may have policies that depend on the surface text of a question and/or its answer options. An example of a policy that chooses answers on the basis of surface text might be one that prefers the lexicographically first option, or one that always chooses surface strings equal to a value (e.g. contains “agree”). These adversaries are significantly stronger than the ones mentioned above.
It’s possible that some could see directly modeling a set of adversaries as overkill; after all, services such as AMT rely on reputable respondents for their systems to attract users (or not?). While AMT has provided means for requesters to filter the population, this system can easily be gamed. This tutorial from 2010 describes best practices for maximizing the quality of AMT jobs. Unfortunately, injecting “attention check” or gold standard questions is insufficient to ward off bad actors. Surveys are a prime target for bad actors because the basic assumption is that the person posting the survey doesn’t know what the underlying distribution of answers ought to look like — otherwise, why would they post a survey? Sara Kingsley recently pointed us to an article from All Things Considered. Emery found the following comment:
I’ve been doing Mechanical Turk jobs for about 4 months now.
I think the quality of the survey responses are correlated to the amount of money that the requester is paying. If the requester is paying very little, I will go as fast as I can through the survey making sure to pass their attention checks, so that I’m compensated fairly.
Conversely, if the requester wants to pay a fair wage, I will take my time and give a more thought out and non random response.
A key problem that the above quote illustrates is that modeling individual users is fruitless. MACE is a seemingly promising tool that uses post hoc generative models of annotator behavior to “learn whom to trust and when.” This work notably does not cite prior work by Panos Ipeirotis on modeling users with EM and considered variability in workers’ annotations.
The problem with directly modeling individual users is that it cannot account for the myriad latent variables that lead a worker to behave badly. In order to do so, we would need to explicitly model every individual’s utility function. This function would incorporate not only the expected payment for the task, but also the workers’ subjective assessment of the ease of the task, the aesthetics of the task, or their judgement of the worthiness of the task. Not all workers behave consistently across tasks of the same type (e.g. annotations), let alone across tasks of differing types. Are workers who accept HITs that cause them dissatisfaction more likely to return the HIT, or to complete the minimum amount of work required to convince the requester to accept their work?