# Defining Equality

This post follows directly from my previous post on correctness. Rather than dive directly into the details of correctness for surveys, I wanted to spend some time thinking about some of the things we might want to prove, show or derive first.

We've stated that one way to define what we mean by survey correctness is to say that the same survey (modulo randomization) administered to the same person, under the same set of conditions, will yield the same set of results. Let's start by assuming that it is possible to repeat the survey for a respondent under the exact same conditions. Then there is is some underlying gold standard for the survey. While we cannot repeat the survey for the same set of respondents, under the exact same conditions in practice, we can emulate this scenario with our simulator. We choose a set of answers, according to a particular profile and let this be the gold standard. The simulated respondent then chooses a response according to the map from questions to answers. Once the survey is complete, we compare the results returned by the respondent with our generated gold standard. The question-answer sets for the gold standard and the returned values should always be equal.

At first, it may seem that this definition of equality is trivial. We have unique identifiers for each question and answer option that are created directly from their data's location in the input csv. The response form uses these ids. One question is displayed at a time and no questions ever repeat (we'll prove this later). We simply check the id of the question currently displayed, look up its answer in our response map, and select the option(s) that has/have matching identifier(s) to our gold standard's responses. So long as the HTML is correct, we will get back the same id, right?

Wrong. Well, a little wrong. If your survey does not use variants, you're fine. If it does, here's the problem: the variant is chosen at runtime i.e. when the respondent is actually executing the program (taking the survey). Since we cannot divine which question will be shown, we need a different way of deciding the appropriate answer.

### Solution 1: Specify the random choice at compile-time, rather than run-time.

This is certainly a compelling argument, and it solves our problem of not knowing what question will be displayed at runtime. When we began work on SurveyMan, we generated all of our surveys statically. All of the randomization was done in Java. We produced the full HTML to be displayed and sent this over as the payload. We didn't have much in the way of an interpreter; we just had some jQuery running in document.ready that hid all question divs and then showed the first question. We were not playing with breakoff at this point, so a next and a previous button were always displayed (except for the first and last questions). We simply swapped which divs were visible and which weren't.

Practical issues We found considerable issues with this approach, enumerated below:

• AMT limitations While this may change at any time, historically there haven't been many degrees of freedom when deciding assignments of workers to tasks on AMT. Since our primary source of respondents was AMT, we had to keep this in mind. Since each survey was static HTML, we had to send each over in its own HIT. In the past, this meant that we would post some number of HITs at a time. The easiest way to do this was to post a survey, wait for a response, validate that response, issue "reverse" qualifications, and then post some more. The "reverse" qualification thing is an Automan technique to disqualify people who had already responded to that HIT under a different randomization.

There are some logistical issues with this approach. Posting one HIT that has one assignment at a time means that we incur significant latency in our response period. We cannot take advantage of the massive parallelism of the crowd. If we instead post the HITs in batches, we have to hedge against workers who will answer some number of HITs in a row before the reverse qualification kicks in. We end up paying for redundant collection, as well as incurring additional latency (due to the workers who completed the HIT more than once). Repeaters are a major threat to this approach -- AMT groups HITs in such a way that similar HITs appear together. The researcher can get around this automated grouping by creating a new group id and putting each HIT in its own group. Now we are back to the original latency problem, because we have one HIT wonders floating around in the MTURK ether.

Note that the AMT system is designed as follows: Requesters may either specify one HIT that has many assignments (redundancy), or they may specify many HITs that will be grouped together with the express purpose of making it easier for one person to perform many HITs. We would like to be able to take advantage of the former, and posting static HTML does not allow this.

• Page Load Time The prototypicality survey had almost 100 questions. There was an obvious lag in loading on several browsers; the contents flashed before hiding the appropriate divs. We could have denoted these divs as hidden, but we would still have the lag of loading in a large file. The problem was even worse when we loaded in media content. It was clear from this alone that some of the content would have to be loaded on an as-needed basis.
• Payload size AMT has a maximum payload size. I can look it up, but needless to say, we hit it with one of the linguistics surveys. I believe that survey had approximately 200 questions -- they were similar in nature to the phonology questions. The true payload would have been smaller, since that survey was designed to be sampled from. However, there will be cases where survey writers are going to want to simulate an infinite stream of survey questions. We would like to be able to send over as large a survey as possible, and this is just not going to happen if we send over a static webpage. Now, one could argue that the cost of the payload was really in the redundant HTML we were sending over -- we could easily have swapped out the question and option text, while still specifying the full set of questions to be seen a priori. This is a fairly compelling argument that adequately addresses both the page load time, as well as the payload size.
• Opportunities for exploitation It's very easy for someone to write a script that, for instance, selects every radio button answer and hits submit. In fact, if we have a submit button sitting around anywhere, we have a problem. We could dynamically generate all buttons as needed, while still sending over the statically chosen questions. The solution at this point would be to send over minimal HTML and instead swap in the data that we need, as needed. All buttons and control flow through the survey would be controlled by a small state machine. This would prevent respondents from writing a script to click on all input of a given type and submit. It's certainly still possible to write a "bot" to submit a form when inputs are generated dynamically, but it will now require the script to simulate the survey, which is more effort. Furthermore, the user has no guarantee that the script will terminate (although we know it will).
• Opportunities to "cheat." Besides automated "cheating," it's also possible for users to see the set of questions being asked, simply by looking at the source code. If we store the questions and option text in a lookup table, they become significantly more difficult for a human to interpret -- while they can still see what the questions are asking, human adversaries will not have the full picture. Even though there is still room for adversarial action, we hope that the increased indirection of dynamically generating the questions increases the complexity of the problem and deters adversaries from gaming the system.

The above arguments make a strong case for generating the survey dynamically. The first point is the strongest practical reason for performing randomization at runtime; for the latter points, randomizing at runtime gives us a nice defense against some adversarial behavior, but it isn't particularly compelling.

A major advantage to randomizing at runtime is that it is fundamentally more principled than randomizing a priori. Our randomization mechanism can also be framed as a an issue of nondeterministic choice -- in the case of variants, we have choose one of a set of these variants. In the case of the total set of questions, it is one of the set of all possible permutations. In principle, we can say that the behavior of the machine at runtime is the same for all permutations of the survey, since the machine itself is selecting which permutation. All equivalent surveys are statically equivalent, with only the runtime behavior differing. This will make certain properties easier to prove later on.

### Solution 2: Mark every question in a variant block with equivalent answers

Recall that the problem we're trying to solve is how to define equivalence between questions. The idea behind fixing the random order at compile time was so that our two parallel universe survey takers would end up with the exact same answers. Even if you don't buy the argument to randomize at runtime, rather than randomizing at compile-time, there's still the problem of comparing across different respondents who see different variants.

A new pass solution might be to answer every question in a variant's block. However, if we do this, we will have answers for every question in that block sitting around in our answer set. The resulting set of question-answer pairs will not be equal to the gold standard set.

We could try to get around this shortcoming by removing the questions that weren't seen. This doesn't quite work either. One of the things we're trying to show is that, not only are the answers consistent (this is the part of the correctness argument that addresses bugs that are unique to surveys), but also that the same set of questions are seen (this is the part of the correctness argument that will address the more traditional programming languages issues in our design). If we just remove the other questions, we run into a dangerous situation where we are adding and removing data in the course of our correctness argument, which will make some things harder to prove. Furthermore, we will need to wait until runtime to decide whether two answer sets are equivalent.

### Solution 3: Define a set of equality relations for our objects of interest

Thus far our solutions have been computational in nature -- their correctness is defined by their implementation. We'd really like a more general solution that will allow us to reason about a survey program from first principles. We would also like a solution that can be applied to a broader swatch of implementations. We'd like our definitions to be the gold standard of correctness for future survey language writers. Let's start with some definitions.

A survey is a set of blocks. A block is a heterogenous set of questions and blocks. A question is a record with a survey-scoped unique identifier and a (possibly empty) set of answer options. An option is a record with a survey-scoped unique identifier. These all follow from our definition of the syntax.

Answer Set An answer set is a set of 3-tuples of question identifiers, option identifiers, and indices.

Two sets of answers are "equal" (denoted as $\equiv_{resp}$, called "response-equivalent") if and only if two conditions are satisfied:

1. There exists a bijective mapping between two answer sets.
2. The tuples in the mapping satisfy an equivalence relation, which we will define.
##### Begin Algebra Tangent

I asked Dan Stubbs if there was an algebraic structure where these properties would make sense. He suggested that perhaps I was defining a quotient group. After a nap and some perusing of Wikipedia (and a dense Algebra text), I'm not entirely sure this is what I want. Here are my unstructured thoughts on this:

• The answer set is a subset of all possible answers. All possible answers for each question type are defined in the language as follows:
• Exclusive (Radio Button) questions The set of individual response options.
• Non-exclusive (Checkbox) questions The power set of response options minus one (you cannot have the empty set -- that is, the respondent must check at least one box).
• Regex questions Any string satisfying the regex, up to the size allowable in the response field (would have to look this up).
• Freetext questions The subset of all strings up to the size allowable in the response field.
• Since we have a clear mapping between questions and answers, we can talk about questions instead. Really, when we say that the response set is invariant under randomization, we're saying that the question-answer pairs should be invariant under randomization.
• According to Wikipedia

In mathematics, specifically group theory, a quotient group (or factor group) is a group obtained by aggregating similar elements of a larger group using an equivalence relation that preserves the group structure.

What's the larger group here? Presumably it would be the set of all possible answers. This doesn't seem right, thought -- we don't want the set of all answers, because this would mean that a valid response could include answering, say, two questions from the same variant block. Instead, I think it ought to be the set of unique answer sets. We're not worrying about breakoff for now, so we don't need to define this as the power set of some set of answers. If we have $n$ questions, each with $m$ radio-button (exclusive) options and no variants and no breakoff, then we have $m^n$ possible answer sets. To account for variants, we multiply by the cardinality of each variant set.

What are the smaller groups? They're the set of question-answer pairs we've collapsed. Now what's the group structure? To be honest, I'm not entirely sure an answer set satisfies the definition of a group. For starters, what's our group operator? I suppose a tempting first choice might be a set addition operator. We could say that if $\lbrace (q_i, a_i, i) \rbrace$ and $\lbrace (q_j, a_j, j) \rbrace$ are each in the set of possible question-answer pairs, then $\lbrace (q_i, a_i, i), (q_j, a_j, j)\rbrace$ is also in the set of possible question-answer pairs. Of course, for $\lbrace (q_i, a_i, i) \rbrace$ and $\lbrace (q_j, a_j, j) \rbrace$ to be floating around, we would need to allow breakoff. Now, what would happen if we tried to combine two complete answer sets? If the answer sets are equivalent, we get back the same answer set. If the answer sets are equivalent, modulo variant questions, how do we decide what's returned? If the answer sets are radically different, then what?

I think at this point, we need to be talking about questions, rather than answer sets. Two answer sets can differ either because the questions seen are different (due to breakoff or variants) or because the answers chosen are different, or both. If we only consider questions, we can define a question-combining operator (rather than a response-combining operator) and get the properties we want. We would need to define some unified variable to represent the set of variants, so any time a variant is combined, we represent the variant with that variable.

And this is about where I'm stuck. It's not clear to me that this kind of algebraic structure is what we want. Wait -- why would we want it in the first place? Well, I like challenging myself to think about these problems from a variety of angles, so it's intellectually satisfying as an exercise in its own right. But it could also prove fruitful later down the line. It might be the case that by having these algebraic properties, we get some proofs of correctness or other properties for free. Reducing a known and well-studied problem to a new domain for the win!

##### End Algebra Tangent

So basically, we need to define this bijective mapping. Here's the short version (since this post has been sitting, incomplete and open in a tab for well over a month):

Two questions are eqivalent if either:

• The questions are identical. That is, they have the same ids (and should belong to the same block, have the same flags, etc. Questions are immutable entries in a spreadsheet, so you could only get two questions with equal ids but unequal other stuff if your lexer/parser/compiler screwed something up).
• The questions are both contained in a block having type ALL. In this case, the number of options will be equivalent. The order of their entry denotes equivalence. That is, the $i^{th}$ option for each question variant is assumed to be equal in its meaning.

Now here's the punchline: since we use different paths through the blocks (and thus different containing block ids) to denote an experimental control (really, a "control path") and to denote experimental designs such as Latin squares, we currently have no way in either our static nor our dynamic analyses to understand equivalence across these paths. Although we can express experimental surveys using SurveyMan, our system currently cannot analyse these surveys as experiments.

# Metrics to answer our research questions

We had a number of excellent questions and comments from our reviewers. I've been working on a few entries to address them.

One of our reviewers pointed out that we didn't provide metrics for our research questions. Another pointed out that our results were qualitative, but that this was okay because we provided case studies, rather than user studies. Since we'd like to be able to do user studies in the future (hopefully the exposure we'll get at OOPSLA will lead to more collaborations!), it's certainly worth thinking about.

### Research Question 1: Is SurveyMan usable by survey authors and sufficiently expressive to describe their surveys?

Usable by survey authors? We think so. Certainly our colleagues in linguistics didn't have any problems with the CSV structure and the PL-ness of it. We were warned a bit more by our colleagues in econ. The argument I've heard is that (a) people are accustomed to flashy interfaces and would put off by using something text-based and (2) the formatting of the data, both the input and output, is counter-intuitive. I'm not entirely surethis is true. First of all, people have been using SAS and SPSS for years. When I took my first stats course in Spring 2004, we used one of those tools (too long ago to remember which). I remember the painstaking data entry involved. The summaries spit out, full of text tables (no graphics) weren't the easiest to interpret. This was the standard tool for statistical analyses in the social sciences at the time (I was an econ major then).

This project has been a challenge \emph{because} less-powerful, flashy alternative exist. Our colleagues in linguistics liked the csv idea because they were already storing their data in csvs (since they're easy to load into R and since they can -- and already were -- programmatically generate the data). It's unclear whether people in fields that are used to generating these things by hand will be as receptive. There's no reason I can see why someone wouldn't be able to write in our csv language, but drag and drop doesn't require reading a spec, and it's what most commercial services are using.

While I think the language is usable as-is, it would be more usable if we had a spreadsheet plugin to statically check the thing. How would we measure the usability of the tool, empirically?

I think one way of evaluating its usability would be to gather a bunch of surveys and recruit people for a user study. A traditional, offline version of this might go as follows:

1. Generate a list of available survey tools.
2. Recruit a representative sample of participants across the social sciences (probably other graduate students at UMass).
3. Administer a survey to those graduate students to learn which survey technologies they'd used before, whether they were familiar with any programming languages, and if they have any experience with spreadsheets and/or databases.
4. Ask the participants to take a survey specification and design a series of surveys with various features on one of the competing survey software packages (that they had not used before) and have them do the same tasks with SurveyMan.
5. Perform a post-treatment survey to ascertain results.

We would measure the time it took to complete the task, the number of clarifying questions the participants ask, and the perceived ease of the task. It might be a good idea to have the post-study follow-up survey about a week later.

Some design features we might want to consider are whether we assign the SurveyMan task first, how much prior experience in either programming languages or survey software served as prior training, and how we present the survey tasks. If we give the participants csvs of questions and just ask them to manipulate the data, we will be biasing them to a more favorable view of SurveyMan. We might instead give them a variety of inputs, such as a csv, a pdf, an English description of a survey we might want to run, and a followup survey based on that English description.

Expressive enough to describe the target population's surveys? Yes. We can certainly express all of the surveys that our clients wanted to be able to express, and in some cases we were able to express more. One interesting aspect of our collaborations is that we've discovered that some things we've been calling surveys are actually experiments. While we can describe them using our language, we currently lack the support in our runtime system to perform the kinds of analyses we'd like to see. This has caused us to pursue work in a new direction.

I think the best way to address this question would be with a list of features describing what it means to be a survey. I've been thinking about the difference between experiments and surveys a lot recently and a checklist like this would be illuminating.

### Research Question 2: Is SurveyMan able to identify survey errors?

We think so. We use standard statistical methods to identify wording bias and question order bias.

For breakoff, we report the distribution of breakoff over questions and positions. Diagnosing whether we have an unusual amount of breakoff requires a prior over comparable surveys, so we don't attempt to define a threshold for whether the amount of breakoff is statistically significant.

We could model a small amount of noise that says, "if this survey is well-formed, every respondent will have equal probability of breaking off at any point before the final question." We could say that there is some small $\epsilon$ representing this probability, which is independent of any features of the survey. Let $r$ represent our sample size. Then the expected number of respondents who broke off from this survey would be $r\epsilon$. If the total number of respondents who broke off before the final question is unusually high, then the survey author would need to debug the survey to figure out why so many respondents were leaving early.

We're clearly be interested in where respondents were breaking off, and not just that they were breaking off. It's possible that the survey is uniformly poorly designed; it's also possible that it's too long or that there is some point in the survey (such as a transition from one block to another) that causes an unusual amount of breakoff. For a flat survey having $n$ questions, every position is identical insofar as the set of questions that may be seen there. In such a case, we might want to model breakoff as a series of Bernoulli trials -- at each point, a respondent may break off with probability $\frac{\epsilon}{n-1}$. We would then expect there to be $\frac{\epsilon}{n - 1} \times r$ people breaking off at each position.

The problem with this model is that although the probability of breaking off at any given point is the same, the number of participants is not \emph{because a single person cannot break off twice}. First, let's model this as a recurrence. Let $B_i$ be the random variable denoting expected breakoff at position $i$. Then we get the recurrence:

$\mathbb{E}(B_1) = \frac{r\epsilon}{n-1}$
$\mathbb{E}(B_{i

Let's see if we can do better substituting in values:

$\mathbb{E}(B_2) = \frac{\epsilon}{n-1}\biggl(r - \frac{r \epsilon}{n-1}\biggr) = \frac{r\epsilon}{n-1}\biggl(1 - \frac{\epsilon}{n-1}\biggr)$
$\mathbb{E}(B_3) = \frac{\epsilon}{n-1}\biggl(r - \bigl(\frac{r\epsilon}{n-1}\bigl(1 - \frac{\epsilon}{n-1}\bigr)\biggr) = \frac{r\epsilon}{n-1}\biggl(1 - \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigr)^2\biggr)$
$\mathbb{E}(B_4) = \frac{\epsilon}{n-1}\biggl(r - \frac{r\epsilon}{n-1}\bigl(1 - \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigr)^2\bigr)\biggr) = \frac{r\epsilon}{n-1}\biggl(1 - \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigl)^2 - \bigl(\frac{\epsilon}{n-1}\bigr)^3\biggr)$
$\mathbb{E}(B_{i

Recall the closed-form formula for a geometric series, shamelessly ripped off Wikipedia:

$\sum_{k=0}^{n-1}ar^k = a\frac{1-r^n}{1-r}$.

Since we're overloading variable names, let's be clear about what represents what: $a_{wiki} \equiv \frac{r\epsilon}{n-1}$, $r_{wiki}\equiv \frac{-\epsilon}{n-1}$, $n_{wiki}\equiv i$. That means our closed form for $\mathbb{E}(B_{i is $\frac{r\epsilon}{n-1}\times\frac{1 - \bigl(\frac{-\epsilon}{n-1}\bigr)^i}{1 - \frac{-\epsilon}{n-1}}$, which we can make prettier as $\frac{r\epsilon}{n-1}\times\biggl(1 - \bigl(\frac{-\epsilon}{n-1}\bigr)^i\biggr)\times\frac{n-1}{n - 1 + \epsilon}$ and reduce to a still hideous $\frac{r\epsilon}{n-1+\epsilon}\biggl(1 - \bigl(\frac{-\epsilon}{n-1}\bigr)^i\biggr)$. No lie, I remember it being less formidable looking in my notes, but they're at my desk, and I'm pretty sloppy with the algebraic manipulation, so I welcome any spot-checking.

So what does this mean? We have a formula for the expected number of respondents who break off at each index. Now we can at least use the Markov inequality to flag really egregious positional breakoff. Since the Markov inequality is very coarse, I have little doubt that researchers would be able to spot the cases it would find. That is, Markov doesn't give us much value for small data sets (note that it's still useful as a feature in an automated tool).

In order to get tighter bounds, we are going to need to know more about the distribution. We can use Chebyshev's inequality if we know the variance. We can get a Chernoff-like bound by substituting into Chebyshev's Inequality the expectation for the moment generating function, if we have it.

I think this process is similar to a Martingale, but I will have to spend more time thinking about this. Clearly the total expected breakoff is the same. The number of participants in the survey is expected to decay. It would be great to get some tight bounds on this. Once we can characterize breakoff for flat surveys, we have a prior on the role position plays and can use this information for diagnosing breakoff in more structured surveys. I haven't considered an equivalent analysis for question-based breakoff yet.

For now, in the analyses, we just stick with reporting the top locations and questions for abandonment/breakoff.

### Research Question 3: Is SURVEYMAN able to identify random or inattentive respondents?

I am working on a more in-depth analysis to help answer this question. It's going to need a whole blog post to itself. We have some gold-standard data from Joe and company for the phonology survey, and there are some heuristics I can use for the prototypicality survey. Graphs forthcoming!

# SocialSci

There's this website I came across while investigating what other people are doing for online experiments. It's called SocialSci.com. They have a platform for writing experimental surveys. They also boast of a well-curated participant pool. Here's the claim on their front page:

We take a three-tiered approach to our participant pool. We first authenticate users to make sure they are human and not creating multiple accounts. We then send them through our vetting process, which ensures that our participants are honest by tracking every demographic question they answer across studies. If a participant claims to be 18-years-old one week and 55-years-old the next, our platform will notify you and deliver another quality participant free-of-charge. Finally, we compensate participants via a secure online transaction where personally identifiable information is never revealed.

Ummmmm, okay. That's not game-able at all.

If you were thinking about just using their survey design tools and instead posting your instrument on AMT, be warned! They point out that AMT is full of scammers! I'm not sure this is true anymore; my own research has lead me to believe that traditional survey issues such as fatigue and attention issues are a bigger threat to validity. In any case, I looked up quotes for using their participant pool. The costs for 150 respondents (assured to be valid, via some common sense heuristics and their more complete knowledge of the participant pool) and no restrictions on the demographics (e.g. can include non-native English speakers) are:

My estimated time (in minutes)Price (in USD)
5300
10300
15300
20300
25300
45300
55450
60450

Even with bad actors, I'm pretty sure AMT is cheaper. The results were the same when I submitted a request for country of origin==USA. There were also options for UK and Australia, but these are not yet available (buttons disabled). I'll leave any analysis of the pay to Sara.

Their filters include Country, Language, Age, Sex, Gender, Sexual Orientation, Relationship Status, Ethnicity, Income, Employment Status, Education, Occupation, Lifestyle, and Ailments. What I find potentially useful for survey writers here is the opportunity to target low-frequency groups. There's our now infamous NPR story on how teens respond mischievously to survey questions. The set of responses from these teens have a high number of low-frequency responses, which amplifies false correlations when researchers analyze low-frequency populations. A service like SocialSci could provide useful, curated pools that rely less on one-time self reporting. However, it doesn't look like they're there yet, since there are the options available (shown, but currently disabled options are listed in brackets):

Country: USA, [UK], [Australia] Language: English, [Spanish] Age: 13-17, 18-40, 41-59, [51+], [60+] Sex: Male, Female, [Transgender], [Intersex] Gender: Cis Male, Cis Female, [Trans Male], [Trans Female] Sexual Orientation: Heterosexual, Homosexual, BiSexual (sic), Other
Relationship Status: Married, Single, Co-Habitating, Dating
Ethnicity: Caucasian, [Asian], [Hispanic], [Black], [Native American], [Multiracial] Income: less than 25K, 25K-50K, 50K-75K, 75K-100K, [100K-125K], [125K or more] Employment Status: Full Time, Unemployed, [Part Time], [Temporary Employee Or Independent Contractor] Education: Some College, Associate's Degree, High School Diploma, Bachelors Degree, Some High School, [Masters Degree], [Doctoral Degree] Occupation: Student, Professional, Technical, Teacher, Sales, Corporate
Lifestyle: Smoke, Used to Smoke, Have Children, Have Cell Phone