On Keeping the Survey a DAG

A topic that came up during my SurveyMan lab talk in October was our lack of support for looping questions. Yuriy had raised the objection that there will be cases where we will want to repeat a question, such as providing information on employment. We argued that, since we were emulating paper surveys (at the time), the user could provide an upper bound on the number of entries and ask the user whether they wanted to add another entry for a category. A concern I had was that, since we’re interested in role of survey length in the quality of responses, and since we allow breakoff, when we have a loop in a question, it becomes much more difficult to tell whether the question is a problem or if the length of the survey is a problem. Where previously we treated each question as a random variable, we would now need to model a repeating question as an unknown sum of random variables.

The probability model of a survey with a loop differs from the model of a survey without one.

The probability model of a survey with a loop differs from the model of a survey without one. Note that while both random variables corresponding to the responses to question Q2 may be modeled by the same distribution, they will have different parameters.

This issue came up again during the OBT talk. The expanded version of Topsl that appeared in the PLT Redex book described a semantics for a survey that was allowed to have these kinds of repeated questions.

We do not think it is appropriate to model such questions as loops. Loops are fundamentally necessary to express computable functions. Since the kinds of questions these loops are modeling are more accurately described as having finite, unknown length, we do not want to encode the ability to loop forever.

Aside from this semantic difference, we see another problem with the potentially perpetual loop. Consider the use-case for such a question: in the case of the lab talk, it was Yuriy’s suggestion that we allow people to enter an employment history of unknown length. In the case of Topsl, it was self-reporting relationship history. If a respondent’s employment or relationship history is very long, they may be tempted to under-report the number of instances. This might be curtailed if the respondent is required to first answer* a question that asks for the number of jobs or relationships they** have had. Then responses in the loop could be correlated with the previous question, or the length of the loop could be bounded. In our setting, where we do not respondents to skip questions, the former would need to be implemented if we were to allow loops at all.

Alternatively, instead of presenting each response to what is semantically the same question as if it were a separate question, we could first ask the question for the number of jobs or relationships, and then ask a followup question on a page that takes the response to the previous question, and displays that number of text boxes on the page. We would still bound the total number of responses, but instead of presenting each question separately, we would present them as a single question.

In the analysis of a survey we ran, we found statistically significant breakoff at the freetext question. We’d like to test whether freetext questions in general are correlated with high breakoff. If this is the case, we believe it provides further evidence that the approach to “loop questions” is better implemented using our approach.

* I just wanted to note that I love splitting infinitives.
** While I’m at it, I also support gender-neutral pronouns. Political grammar FTW!

Leave a Reply

Your email address will not be published. Required fields are marked *