Monthly Archives: May 2014

Reading Rainbow

It’s only a few short weeks until PLDI 2014. Oh, the tedious and expensive travel! Just kidding (well, not really — it will involve quite a few trains and many, many dollars).

Inspired by Alex Passos‘s yearly NIPS reading list, I’m going throw together one of my own. Rather than listing abstracts, I’m going to just post an ordered list of the papers I’m going to read and post on individual papers as I see fit.

Tier 1 : Authors I know

Unless the conference is massively multi-tracked, I find having to ask if someone I’ve actually met and spoken with IRL if they have a paper at the conference a bit tactless. This isn’t to say I haven’t done it, or that I’ve done so in a completely shameless way. I do however recognize that refraining from such behavior is A Good Thing.

  1. Doppio: Breaking the Browser Language Barrier
    John Vilk, University of Massachusetts, Amherst; Emery Berger, University of Massachusetts, Amherst.
  2. Expressing and Verifying Probabilistic Assertions
    Adrian Sampson, University of Washington; Pavel Panchekha, University of Washington; Todd Mytkowicz, Microsoft Research; Kathryn S McKinley, Microsoft Research; Dan Grossman, University of Washington; Luis Ceze, University of Washington.
  3. Resugaring: Lifting Evaluation Sequences through Syntactic Sugar
    Justin Pombrio, Brown University; Shriram Krishnamurthi, Brown University.
  4. Taming the Parallel Effect Zoo: Extensible Deterministic Parallelism with Lvish
    Lindsey Kuper, Indiana University; Aaron Todd, Indiana University; Sam Tobin-Hochstadt, Indiana University; Ryan R. Newton, Indiana University.
  5. Introspective Analysis: Context-Sensitivity, Across the Board
    Yannis Smaragdakis, University of Athens; George Kastrinis, University of Athens; George Balatsouras, University of Athens.
  6. Dynamic Space Limits for Haskell
    Edward Z. Yang, Stanford University; David Mazières, Stanford University.

Tier 2 : Authors my advisor knows

It’s a reasonable assumption that my advisor probably knows at least one author on each paper, so we can also call this category “Authors whom I might reasonably expect to be introduced to by My Advisor.” These papers include authors whose work I’ve read before and whose names I know from discussions with my advisor. Reading these papers will help prevent the awkward standing-there thing that happens when someone who is much more comfortable than you are (er, than I am) is deep in a conversation and you (I) have nothing to add. It’ll also provide a hook that’s socially more acceptable to whatever random thought happens to be passing through your (my) head. Genius, this plan is!

  1. Fast: a Transducer-Based Language for Tree Manipulation
    Loris D’Antoni, University of Pennsylvania; Margus Veanes, Microsoft Research; Benjamin Livshits, Microsoft Research; David Molnar, Microsoft Research.
  2. Automatic Runtime Error Repair and Containment via Recovery Shepherding
    Fan Long, MIT CSAIL; Stelios Sidiroglou-Douskos, MIT CSAIL; Martin Rinard, MIT CSAIL.
  3. Adapton: Composable, Demand-Driven Incremental Computation
    Matthew A. Hammer, University of Maryland, College Park; Yit Phang Khoo, University of Maryland, College Park; Michael Hicks, University of Maryland, College Park; Jeffrey S. Foster, University of Maryland, College Park.
  4. FlashExtract : A Framework for Data Extraction by Examples
    Vu Le, UC Davis; Sumit Gulwani, Microsoft Research Redmond.
  5. Test-Driven Synthesis
    Daniel Perelman, University of Washington; Sumit Gulwani, Microsoft Research Redmond; Dan Grossman, University of Washington; Peter Provost, Microsoft Corporation.
  6. Consolidation of Queries with User Defined Functions
    Marcelo Sousa, University of Oxford; Isil Dillig, Microsoft Research; Dimitrios Vytiniotis, Microsoft Research; Thomas Dillig, UCL; Christos Gkantsidis, Microsoft Research.
  7. Atomicity Refinement for Verified Compilation
    Suresh Jagannathan, Purdue University; Vincent Laporte, INRIA Rennes; Gustavo Petri, Purdue University; David Pichardie, INRIA Rennes; Jan Vitek, Purdue University.

Tier 3 : The Competition

The Student Research Competition, that is. Some of those presenting at SRC are also presenting work at the main event. Since we’ll presumably have some forced socialization, it’s probably a good call to get an idea of what some of their other work is about.

  1. A Theory of Changes for Higher-Order Languages – Incrementalizing Lambda-Calculi by Static Differentiation
    Yufei Cai, Philipps-Universität Marburg; Paolo G. Giarrusso, Philipps-Universität Marburg; Tillmann Rendel, Philipps-Universität Marburg; Klaus Ostermann, Philipps-Universität Marburg.
  2. Commutativity Race Detection
    Dimitar Dimitrov, ETH Zurich; Veselin Raychev, ETH Zurich; Martin Vechev, ETH Zurich; Eric Koskinen, New York University.
  3. Code Completion with Statistical Language Models
    Veselin Raychev, ETH Zurich; Martin Vechev, ETH Zurich; Eran Yahav, Technion.
  4. Verification Modulo Versions: Towards Usable Verification
    Francesco Logozzo, Microsoft Research; Manuel Fahndrich, Microsoft Research; Shuvendu Lahiri, Microsoft Research; Sam Blackshear, University of Colorado at Boulder.
  5. Adaptive, Efficient Parallel Execution of Parallel Programs
    Srinath Sridharan, University of Wisconsin-Madison; Gagan Gupta, University of Wisconsin-Madison; Gurindar Sohi, University of Wisconsin-Madison.
  6. Globally Precise-restartable Execution of Parallel Programs
    Gagan Gupta, University of Wisconsin-Madison; Srinath Sridharan, University of Wisconsin-Madison; Gurindar S. Sohi, University of Wisconsin-Madison.

There are 13 participants in the SRC total. Five are presenting at the conference proper (One is on a paper in another tier).

Tier 4 : Pure Interest

No motivation, except that the papers look interesting.

  1. Improving JavaScript Performance by Deconstructing the Type System
    Wonsun Ahn, University of Illinois at Urbana Champaign; Jiho Choi, University of Illinois at Urbana Champaign; Thomas Shull, University of Illinois at Urbana Champaign; Maria Garzaran, University of Illinois at Urbana Champaign; Josep Torrellas, University of Illinois at Urbana Champaign.
  2. Automating Formal Proofs for Reactive Systems
    Daniel Ricketts, UC San Diego; Valentin Robert, UC San Diego; Dongseok Jang, UC San Diego; Zachary Tatlock, University of Washington; Sorin Lerner, UC San Diego.
  3. Tracelet-Based Code Search in Executables
    Yaniv David, Technion; Eran Yahav, Technion.
  4. Getting F-Bounded Polymorphism into Shape
    Benjamin Lee Greenman, Cornell University; Fabian Muehlboeck, Cornell University; Ross Tate, Cornell University.
  5. Compositional Solution Space Quantification for Probabilistic Software Analysis
    Mateus Borges, Federal University of Pernambuco; Antonio Filieri, University of Stuttgart; Marcelo D’Amorim, Federal University of Pernambuco; Corina S. Pasareanu, Carnegie Mellon Silicon Valley, NASA Ames; Willem Visser, Stellenbosch University.
  6. Test-Driven Repair of Data Races in Structured Parallel Programs
    Rishi Surendran, Rice University; Raghavan Raman, Oracle Labs; Swarat Chaudhuri, Rice University; John Mellor-Crummey, Rice University; Vivek Sarkar, Rice University.
  7. VeriCon: Towards Verifying Controller Programs in Software-Defined Networks
    Thomas Ball, Microsoft Research; Nikolaj Bjorner, Microsoft Research; Aaron Gember, University of Wisconsin-Madison; Shachar Itzhaky, Tel Aviv University; Aleksandr Karbyshev, Technical University of Munich; Mooly Sagiv, Tel Aviv University; Michael Schapira, Hebrew University of Jerusalem; Asaf Valadarsky, Hebrew University of Jerusalem.
  8. AEminium: A permission based concurrent-by-default programming language approach
    Sven Stork, Carnegie Mellon University; Karl Naden, Carnegie Mellon University; Joshua Sunshine, Carnegie Mellon University; Manuel Mohr, Karlsruhe Institute of Technology; Alcides Fonseca, University of Coimbra; Paulo Marques, University of Coimbra; Jonathan Aldrich, Carnegie Mellon University
  9. First-class Runtime Generation of High-performance Types using Exotypes
    Zachary DeVito, Stanford University; Daniel Ritchie, Stanford University; Matt Fisher, Stanford University; Alex Aiken, Stanford University; Pat Hanrahan, Stanford University.

Why is pure interest ranked last?

People say that a talk should be an advertisement for the paper. If I don’t get through the papers in tier 4 before PLDI, I’ll at least know which talks I want to go to and perhaps prune that list accordingly. Since a conference is actually a social event, it seems like a better use of time to target papers that I would expect to come up in conversation. I haven’t tried this tactic before, so we’ll see how things go!

Finally, I’d like to thank the NSF, the ACM, and President Obama for help on my upcoming travel.

Get money, get paid.

When we allow breakoff, we typically tell respondents that they will be paid a bonus commensurate with the amount and quality of work they do at the end of the study. We don’t typically inform them of when the study will end, since we may need to re-run surveys and we’d like to keep respondents from trying to game our algorithms. The experiments we did in the fall that awarded bonuses used a tiered system. The ones listed here are using respondent’s scores to identify the top 95% of scores and award those respondents two cents per question. I’d like to investigate more sophisticated pricing schemata with Sara in the future.

Wage Survey

It was my intention to only have one wage survey running on AMT. However, as I’ve been porting the Python analyses into Clojure, I found that I actually had three instances running. Given the expiration dates on the latter two, I’m pretty sure they were posted accidentally. I should probably consider asking the user if their sure they want to continue, or have a safe mode that asks a million times “are you sure you want to do X?” so future users don’t make this mistake. There’s also small possibility that when I extended the original HIT, it somehow spawned two new HITs instead. This isn’t documented anywhere, but it’s something I probably want to double check on sandbox.

So we had three surveys running. At the time of our OOPSLA submission, we had the wage survey running for about four days and only accrued 69 responses. I extended that HIT twice. It expired Mar 26 2014, 04:42 PM PDT. The other two HITs had expiration dates of Mar 28 2014, 04:44 PM PDT and Mar 30 2014, 07:22 PM PDT. Each HIT requested 150 assignments, and paid $0.10 base wage per survey. Between the three surveys, we collected 154 responses. Under normal circumstances, I wouldn’t have three of the same HIT running concurrently — a feature I might consider adding to SurveyMan is a check for whether a survey with similar parameters has been posted before. If I implement a “safe mode” version of SurveyMan, I could ask then ask the user if they really want to post this survey.

Anyway, the point is that because I had three versions running, I had repeaters. We had only 132 unique respondents. I typically exclude repeaters from the analyses, since we tell them to return the surveys if they haven’t taken them before. After running our new dynamic analyses report, I found that 98 respondents were classified as bad actors. I had a similarly high percentage in the Python analysis and wasn’t confident that it was correct. Since we hadn’t tested the effects of breakoff on bot classification in simulation, I was hesitant to make any strong assertions about these classifications, without investigating further. Furthermore, since we had so few data points at the time of the OOPSLA submission, we decided to simply report qualitative results.

Examining the larger data set, we found that the maximum number of questions answered by any one respondent was still 26. I’ll leave a more thorough analysis of our quality control to another blog post, but the results were quite interesting and corroborated some of my suspicions about the the original set of data. In any case, the number of questions bad actors answered had a high of 18 and a low of 2 (interestingly, those who only answered one question were confined to repeaters. Inspecting manually, I saw 10 repeaters in total, of whom only one appeared to be a legitimate bad actor.).

Using the federal minimum wage of $7.25 an hour and an estimated 10 seconds per question, we should award $0.02 per question. Since we already awarded a base pay of $0.10, we subtracted this quantity off the total payment calculated for honest respondents. The static analyses gave us an average path length of 41 questions and an expected payment of $0.825694 for each respondent who answered the survey to completion. Since we requested 150 responses, if every respondent were categorized as honest and answered the survey to completion, it would cost us $123.8541, not counting the AMT commission. AMT charges a 10% commission on both the base pay and the bonuses. This gives us a total expected cost of $136.23951.

Our actual costs were much lower (although the quality of the data was presumably also lower). Our 154 respondents cost us $22.53 in base pay, including AMT commission. We calculated $7.78 in bonuses to be awarded, costing us $8.558 for the commission. In total, the experiment under these conditions cost us $31.088.

Prototypicality

The next survey we’ll look at is Presley’s prototypicality survey. We had 149 responses total. The survey had an average path length of 17 questions. The estimated base price for honest respondents who answer the survey in full is $0.342361 per survey. Our expected cost for all honest respondents, plus AMT commission is $56.489565. We classified 65 respondents as bad actors, and 84 as valid responses. These results differ from our initial reported results due to how we calculate the frequencies for questions that are variants. In our Python code, we only compared questions that were exactly the same — that is, we didn’t unify the distributions of the variants. In the Clojure implementation, we first remove bots, and assume that there is no statistical difference between the questions, unifying all the variants. We’ll leave a discussion of the pros and cons of this approach to a future blog post.

The result of unifying variants is that significantly more responses are classified as bots, but none of the variants are flagged as being drawn from different distributions! I’ll have to double check to make sure that the variant code is running as expected, but since my unit test for flagging variants seems to work, I’m going to assume that the differences we detected were due to outliers. We can discuss what might be going on here in a later blog post, since this one is mostly about pricing. The bonuses to be paid amount to $20.32. With AMT commission this is $22.35. In total, this survey costs $38.742.

Choose Randomly

I ran a survey that wasn’t featured in our OOPSLA paper because it was one I made up entirely. The idea was to post a survey with two floating blocks, where both asked respondents to choose one of the responses randomly, but the one block had identical options, whereas the other had arbitrary categories of things. I wanted to see how random people could actually be. I also wanted to track timing information. One of the things I noticed about this survey, which I posted on a Wednesday morning, was that I was able to collect all of my responses within 90 minutes. The time per question here is clearly less than 10s. If it takes about 2 seconds per question, then the expected cost for a completed survey would be about $0.08. As a result, I decided to not award bonuses. The total cost was $16.50.

Phonology

Finally we have our classic phonology survey. With an average path length of 99 questions, we would expect to pay $1.993750 per survey. With a target of 150 responses, our total cost, including commission, would be $328.96875.

We ran this survey 3 times (not counting our preliminary run). The details of each run can be found in an earlier post. We collected 395 responses total and had 311 unique respondents. 22 respondents accounted for the 84 duplicate responses. This initial run cost us $43.45. 182 responses were classified as valid responses. The total bonuses to be paid were calculated to be $327.96. Factoring in AMT commission, this comes to $360.756. In all, this survey will have cost us $404.206.

Notes on timing

Some of the the timing data we returned was flawed, so we couldn’t use that information to improve our payment scheme. One possible use of this information would be to use it as a proxy for the difficulty the question and vary the payment in accordance with this information.

On calculating survey entropy

I’ve been spending the past two weeks converting analyses that were implemented in Python and Julia into Clojure. The OOPSLA Artifact Evaluation deadline is June 1 and moving these into Clojure means that the whole shebang runs on the JVM (and just one jar!).

One of the changes I really wanted to make to the artifact we submit was a lower upper bound on survey entropy. Upper bounds on entropy can be useful in a variety of ways: in these initial runs we did for the paper, I found them useful for comparing across different surveys. The intuition is that surveys with similar max entropies have similar complexity, similar runtimes, similar costs, and similar tolerance to bad behavior. Furthermore, if the end-user were to use the simulator in a design/debug/test loop, they could use max entropy to guide their survey design.

We’ve iterated our calculation of the max entropy. Each improvement has lowered the upper bound for some class of surveys.

Max option cardinality Our first method for calculating maximum entropy of a survey was the one featured in the paper: we find the question with the largest number of options and say that the entropy of the survey must be less than the entropy of a survey having equal number of questions, where every question has this maximum number of answer options. Each option has equal probability of being chosen. For some $$survey$$ having $$n$$ questions, the maximum entropy would then be $$\lceil n \log_2 (\max ( \lbrace \lvert \lbrace o : o \in options(q) \rbrace \rvert : q \in questions(survey) \rbrace ) ) \rceil$$.

The above gives a fairly tight bound on surveys such as the phonology survey. For surveys that have more variance in the number of options proffered to the respondent, it would be better to have a tighter bound.

Total survey question max entropy We’ve had a calculation for total survey question max entropy implemented in Clojure for a few weeks now. For any question having at least one answer option, we calculate the entropy of that question, and sum up all those bits. For some $$survey$$ having $$n$$ questions, where each question $$q_i$$ has $$m_i$$ options, the maximum entropy would then be $$\lceil \sum_{i=1}^n \mathbf{1}_{\mathbb{N}^+}(m_i)\log_2(m_i)\rceil$$

While the total survey question max entropy gives a tighter bound on surveys with higher variance, it is still a bit too high for surveys with branching. Consider the wage survey. In Sara’s initial formulation of the survey (i.e. not the one we ran), the question with the greatest number of answer options was one asking for the respondent’s date of birth. The responses were dates ranging from 1900 to 1996. Most of the remaining questions have about 4 options each:

#/Options#/Questions
#/Options#/Questions
2
872
3581
41691
52101
62971

Clearly in this case, using max option cardinality would not give much information about the entropy of the survey. The max cardinality maximum entropy calculation gives 258 bits, whereas the total survey question max entropy gives 80 bits.

This lower upper bound still has shortcomings, though — it doesn’t consider surveys with branching. For many surveys, branching is used to ask one additional question, to help refine answers. For these surveys, many respondents answer every question in the survey. However, there are some survey that are designed so that no respondent answers every question in the survey. Branching may be used to re-route respondents along a particular path. We used branching in this way when we actually deployed Sara’s wage survey. The translated version of Sara’s survey has two 39-question paths, with a 2-option branch question to start the survey and zero option instructional question to end the survey. This version of the survey has a max cardinality maximum entropy calculation of $$80 * \log_2 97 = 528$$ and a total survey question max entropy of 160 bits (without the ceiling operator, this is approximately equal two two times the entropy of the previous version, plus one bit for the introductory branch question).

The maximum number of bits needed to represent this survey approximately doubled from one version to the next. This isn’t quite right — we know that the maximum path through the survey is 41 questions, not 80. In this case, branching makes a significant difference in the lower bound.

Max path maximum entropy Let’s instead compute the maximum entropy over distinct paths through the survey. We’ve previously discussed the computational complexity of computing distinct paths through surveys. In short, randomization significantly increases the number of possible paths through the survey; if we focus on path through blocks instead, we have more tractable results. Rather than thinking about paths through the survey as distinct lists of questions, where equivalent paths have equivalent lengths and orderings, we can instead think about them as unique sets of questions. This perspective aligns nicely with the invariants we preserve.

Our new maximum entropy calculation will compute the entropy over unique sets of questions and select the maximum entropy computed over this set. Some questions to consider are:

  1. Are joined paths the same path?
  2. If we are computing empirical entropy, should we also consider breakoff? That is, do we need the probability of answering a particular question?

We consider paths that join distinct from each other; the probability of answering that question will sum up to one, if we don’t consider breakoff. As for breakoff, for now let’s ignore it. If we need to compute the empirical entropy over the survey (as opposed to the maximum entropy), then we will use the subset relation to determine which questions belong to which path. That is, if we have a survey with paths $$q_1 \rightarrow q_2 \rightarrow q_4$$ and $$q_1 \rightarrow q_3 \rightarrow q_4$$, then a survey response with only $$q_1$$ answered will be used to compute the path frequencies and answer option frequencies for both paths. The maximum entropy is then computed as $$\lceil max(\lbrace -\sum_{q\in survey} \sum_{o \in ans(q)} \mathbb{P}(o \cap p) \log_2 \mathbb{P}(o \cap p) : p \in paths \rbrace) \rceil$$.

There are two pieces of information we need to calculate before actually computing the maximum entropy path. First, we need the set of paths. Since paths are unique over blocks, we can define a function to return the set of blocks over the paths. The key insight here is that for blocks that have the NONE or ONE branch paradigm, every question in that block is answered. For the branch ALL paradigm, every question is supposed to be “the same,” so they will all have the same number of answer options. Furthermore, since the ordering of floating (randomizable) top level blocks doesn’t matter, and since we prohibit branching from or to these blocks, we can compute the DAG on the totally ordered blocks and then just concatenate the floating blocks onto the unique paths through those ordered blocks.

The second thing we need to compute is $$\mathbb{P}(o \cap p)$$. The easiest way to do this is to take a survey response and determine which unique path(s) it belongs to. If we count the number of times we see option $$o$$ on path $$p$$, the probability we’re estimating is $$\mathbb{P}(o | p)$$. We can compute $$\mathbb{P}(o \cap p)$$ from $$\mathbb{P}(o | p)$$ by noting that $$\mathbb{P}(o \cap p) = \mathbb{P}(o | p)\mathbb{P}(p)$$. This quantity is computed by $$\frac{\# \text{ of } o \text{ on path } p}{\#\text{ of responses on path } p}\times\frac{\#\text{ of responses on path } p}{\text{total responses}}$$, which we can reduce to $$\frac{\# \text{ of } o \text{ on path } p}{\text{total responses}}$$. It should be clear from this derivation that even if two paths join, the entropy for the joined sub path is equal to the case where we treat paths separately.

The maximum entropy for the max path in the wage survey, computed using the current implementation of SurveyMan’s static analyses, is 81 bits — equivalent to the original version of the survey, plus one extra bit for the branching.