This is going to be a short post that I’ll expand on more, post-portfolio…

One of the nice features of AutoMan is that it manages the pricing of a task for you. The user only needs to specify a maximum amount they’re willing to pay, and AutoMan will return the result at the optimal price. It first computes the number of agreeing responses needed to have high confidence that the result is correct. Then it starts at an initial baseline assignment duration (i.e. the time expected to complete the task). The initial duration may be provided by the user; the default setting is 30 seconds. AutoMan uses this time to compute the wage, which is tied to the US federal minimum wage. The assignment is posted on Mechanical Turk for some lifetime set to one hundred times the task duration. If no results come back during that lifetime, it doubles the task time and reposts the job.

Of course, there’s one caveat : since AutoMan relies on there being a single answer, if the population is in disagreement, the budget will be used up and no results will be returned.

One of the original motivations for SurveyMan was to address the idea of returning distributions of results, rather than point estimates of results. We were also interested in computing end-to-end confidence intervals for chained AutoMan computations. We realized that the underlying structure of chained distributions of functions exactly modeled surveys. Thus, SurveyMan.

SurveyMan today is quite far from this original motivation. Since we began collaborating with social scientists, we veered into experimental design and discovery, rather than static analysis or speculative execution of programs having functions returning probabilistic results.

Determining the optimal price for a SurveyMan task is a feature from AutoMan that I’d like to see in SurveyMan. AutoMan’s automated pricing scheme was possible because the system could calculate the number of responses needed to determine if the answer had been found. This is significantly more challenging for SurveyMan. If we treat a survey as a the joined probability distribution of each of its questions, then determining the sample size of “good” respondents boils down to power analysis. However, the techniques here are somewhat different; power analysis was designed for either the case where the probability of certain conditions could be computed directly, or in *post hoc* data analysis, once the data had already been collected. A first-pass consideration of what we’re really looking for here is an online algorithm to determine the convergence of distribution of the joint probabilities of the questions. I don’t think this is a bad start, but I worry about a decision procedure that relies on such complex data. For example, if we consider a survey that’s flat, this means we treat each question as exchangeable. This does not however mean that the questions are independent. Suppose for a moment that they were; then we could say something like, the survey is a random variable defined to be the sum of the random variables representing the questions:

$$ S = Q_{1} + Q_{2} + … + Q_{n} $$

This survey has $$n$$ questions and each of the random variables $$Q_i, 1 \leq i \leq n$$ corresponds to the distribution of the answer texts — that is, it has no notion of its own position in the survey.

We could then decide that our stopping condition is the case where our expectation converges. Since expectation is linear, we can look at the convergence of each question’s distribution and make our decision then.

Okay, so if the questions were actually independent, I don’t see being such a bad approach. I guess if we assume that the underlying population can be represented as the sum of some unknown number of independent normal distributions, we can say that the mean is a sufficient statistic and call it a day.

Of course, we have little reason to believe that the questions are independent. While randomizing the order of the questions simplifies our identification of bugs, it’s quite different when we consider the convergence of distribution. We would need to consider each instance of the survey as a Bayesian network, where the previous questions are parents of the following questions. We already perform pairwise correlation tests; we could use some of this information to determine independence. If we could simplify the model sufficiently, we might be able to converge on an optimal number of samples. We could use the independence assumption to calculate a lower bound on the number of responses needed.

Anyway, this was meant to be a short post — the idea is to present some of the difficulties of automatically determining pricing in SurveyMan. The point is that the pricing mechanism itself depends on knowing how many “good” responses we need and answering that is hard. Even if we could answer that question (and we should — it’s certainly been on the minds of our colleagues in linguistics), we would then need to consider the effects of allowing breakoff, which further complicates things.

Emma ToschPost authorNote that determining your sample size/pricing requires real-time decision making and any future work in this area should draw on the literature from online algorithms.