Notes from Charles Sutton’s Talk

About a month ago, Charles Sutton stopped by UMass to give a talk called “Statistical Analysis Of Computer Programs.” Here are my lightly-edited, cleaned-up (ish) notes on the talk (this approach inspired by ezyang‘s amazing note-taking abilities):

conceit — text as language
(bad metaphor)
computer program = precise set of instructions
people — more aspects (social interactions)
language a good metaphor as a mean of human communication

when writing code, others may want to use later
next person might be you!
— lots of open source code online
— lots of implicit knowledge in how to write software — libraries api, avoid bugs
— easy to read, easy to maintain
take implicit knowledge and make explicit
writing code in a new library, look at what people did in the past
suggest patterns
key insight: means of communication
regularities of natural language (NL) may be found in programming languages (PL)
no new machine learning
apply existing techniques to statistical NLP, new patterns
coding conventions
— names, formatting
— get a compressed representation of long verbose source files
mining idioms
— small syntactics patterns in code
describe how we use them

NL coding conventions
local->global movement of abstraction
what kinds of coding conventions?
junit example
create input stream in java
create an identifier
name of input stream
maybe know the type of name someone would use who contributed to the junit project
formatting–e.g. braces
coding convention is a syntactic constraint beyond that imposed by the languages grammar
programmers themselves decide to impose on top of what the compiler requires them to do
developers care a lot
style checkers style guide

“small amount” of research in software engineering
how they use these in software engineering…

go through conventions, find out what’s important — big commercial (microsoft) projects

threads different aspects of code committed

38 percent related to conventions rather than functionality
(why i hate code review!)

why not just run a formatter over the code?
corner cases — it doesn’t handle for you
renaming variables to be more consistent → (jsnice)
review time — can talk more about the functionality
where do conventions come from?
– implicit from code base
one programmer starts, others pick it up
emergent quality
mores, rather than laws
large number of software constraints, well modeled with statistical machine learning
even with a lot of programming experiences, won’t know about how things are named
coding convention inference problem –> why not use machine translation to take my conventions and change them to your conventions?

eclipse plugin called devstyle

click on your identifier, will give you a list of other names
renaming suggestions
how should class objects be named
some disagreement with conventions
we can suggest a name
go through a region and rank names
block of code someone wants to add to the project

how large of a corpus would you need?

what kind of technology do you use inside the scoring function?
n-gram language model

smoothing, taking into account the constraints of the compiler/language conventions
constraint is library of changes
only renaming, not generating syntactically correct (would not be syntactically correct)

pull together all uses of the identifier

look at the set of all other names that have been used anywhere in previous or succeeding context
ask ngram language model of joint prob. of entire file — sounds really expensive
— actually using any ngram centered about the thing we want to rename

laura: coreference analysis on the code
— knowing that i and i are the same gives you a nonlinear language mode
— can you get something more robust
— tapping the compiler for name resolution
— how do you corporate into the model (only incorporate into the suggestions — done post hoc)

score by ngram model, threshold so user doesn’t see terrible suggestions
side effect of architecture:
don’t want names to be very common
system does not choose really common names
sympathetic uniqueness principle
have variable program entity, give unusual name
have a very domain specific thing, such as smoothing, choose an appropriate name, considered appropriate big statistical properties

in training set, if we have an id that occurs infrequently, give ? for unknown (deals with tails of the distribution)
suggestion process for alternatives, use known token
whatever context, use rare word, don’t suggest change

what happens if you have a common word, but there should be an unusual — don’t know if there is an answer to this common/question

like adding a new table in a conditional random field (CRP)

formatting conventions
encoding spacing decisions as tokens (indexed by location, foo)

use same framework or suggestions on this basis

does this thing work?
evaluation methodology
automatic evaluation:
— doesn’t say what this is (is it the generative thing, idempotency — thats how you would explain)
— should really being doing a case study no user study or human evaluation or whatever

don’t want low precision win this tool because people won’t use it
95% accuracy — basically, what google does
can do this for other types of identifiers, but its harder
sympathetic uniqueness — do we rename everything as i?
x axis is similar, how often do we make revisions
y axis — element of surprise — things that were rare, what percentage did we incorrectly try to say were something else
set threshold high, no suggestions, no new names
set threshold low, rename everything

methods, variables, and types are very different
variables and types back off
methods are much more surprising all of the time

naturalize tools

final thing — test on github, submit patches
look at top suggestions on top 5 suggestions
submitted 18 patches
14 accepted
do programmers really care about this kind of name
suggest that their exceptions be renamed from e to t?
t was okay
throwable (t) to e
people accept t
evidence that programmers think about and care about
paper on arxiv — fowkes, ranca, allamanis, lapata, sutton

question: exists bad users?
– AI complete question

question: hard metrics for successful renaming — most people like it better?
– programmers are picky
– does it actually make programs easier to maintain

question: fancier language model better?
– yes, think so
– types are names recovering — very conventional (very GP)
– java — names are 1:1 correspondence to class

question: run on dynamic languages?
— no results
— corpus of java, c, python
don’t know if its run on python — i don’t think it would work as well on python because it is not as redundant as java

new topic:
autofolding to summarize code

summarize, compress out java boilerplate
use code folding (which obviously uses the fact that we know blocks are denoted by braces)

is the summary just folding?

difference audiences
task based vs non task based
experience versus novice
expert in project
first look problem — opening single file first time, get overview

TASSAL — tree based autofolding software summarization algorithm
start with file, parse, say we want to compress certain types — block statements, comments

foldable tree — subset of AST that contains nodes we could consider following

file → bag of words — split identifiers by camel case
some of these are going to be generic java stuff
some are concepts used throughout the project

topic models — find characterizing words for files and packages; tried method as leaves, but this was too sparse

for each node, pick a mixing distribution

single topic per file — packages or other levels of abstraction <– wonder if you could use this for refactoring

— fit the model
this will give us for each token in the source file an indicator variable whether this thing is generated from java, package, file (how characteristic)

think of optimization problem
vector u binary vector
each element indicates whether node is folded
— okay, so topic models for summarization, dug
look at all tokens assigned to file via generative process
empirical distribution of nodes included in the summary

constraints within budget for length of the summary

tree consistency
— if a node is included in the summary, must include parents

optimise via greedy algorithm

question: what about naming or other conventions that are drawn from different natural language distributions?
— clusters of developers that following different conventions — cluster developers together with both formatting naming — models don't work as well

question: what if i have multiple devs collaborating ON THE SAME FILE?
— topic per continent
— run topic per content
— where does z come from?

taking topic models and applying to name? not done yet

look at example topics
example columns are topics
three background topics (e.g. get string baclue name type object i)
projects: spring, bigbluebutton
files: datasourceutils, qualsp

to evaluate:
create gold standard
folded files manually to measure precision and recall
compare with
javadocs — always include, add random
shallowest to deep
expand nodes in order of length
heuristic, but that’s all eclipse is doing — comparing with state of the art

show summaries to developers
6 developers, avg. 4 years industrial experience
— rate conciseness and usefulness
these are more concise and useful
automatic summaries from TASSAL were better than any of the other baselines

third thing:
mining idioms from code using existing NLP tools

what are code idioms?

example: reading into a buffer, iterating over an array
— are these all things that are encapsulated by other abstractions?
opening resources/context — common pattern
need meta variables

code idiom is a syntactic code fragment that recurs frequently across software projects and has a single semantic purposes

— wondering if you could learn to match ASTs from one language to another (from one that has these higher level abstractions to one that doesn't)

idiom-related tools — intellij and eclipse

no way of identifying which idioms are useful (presumably to add them to the IDEs — how do you find new ones)

other types of code patterns
— surface level — code clones copy past code garments
api patterns usage patterns of methods

idiom mining problem —
can i find these templates?

use a probabilistic grammar
CFG slides, pCFG slide

use tree substitution grammar — generalization of a tree joining grammar
non-terminal can expand into a tree instead of a list of terminals and nonterminals

can make a probabilistic version

tree substitution grammar over tree nodes and regular expansions
represents a family of idioms

this will allows us to represent these idioms

input: probabilistic tree substitution grammar within
take a corpus of ASTs
learn the grammar
every tree rule in the TSG that i learn i treat as an idiom

convert into a textual representation

build a library of idioms to show developers
how do we infer the grammar?

maximum likelihood conditioned on the pTSG rules

previous work from sharon goldwater et al
— infer what these trees are, pick list of trees that best explain the corpus
— number of possible things i could put in theta are intractable, maximum likelihood is degenerate, pick 1:1 rules to trees

don't make tree fragments too big
put a prior on probabilistic grammars
if you're going to add another tree, this is what the idiom would like
get a joint distribution over pCFGs and source files
dist. over dist. of parse trees

given a corpus code of code to get a distribution over probabilistic grammars over trees i've inferred
type-based MCMC from liang et al; (think this is from liang’s GP-like work)

some questions i didn't catch

mined idioms
iterator, loop through lines, logger for class, string constant
get patterns you would actually find
get something from actual APIs
e.g. database transaction (opening a resource and cleaning up properly)
get the distance between two points in ??
jsoup get mhtl

lots of work in SE in API mining
no syntactically nested things

take a held out set of files
percentage of AST nodes explained by the —/
existing method for clone detection
completely duplicated ?

copy paste phenomenon

idioms we find occur across projects much more often…?
SE perspective — dozens of papers in SE about copy past clones

if these things are really idioms, maybe they will occur more often in example code — actually what happens
from a data set of regular projects on github, we find 22% of idioms found are actually used in examples, higher in stack overflow

finally, can do a co-occurrence matrix — how are idioms used across different projects
eclipse snipmatch
— open source addition to eclipse — manually took 44 snippets, stuff worked or something
cant put all the idioms in the tool, so many found
considered this as validation that the thing works
interesting that there i was one idiom used that is considered bad practice

exploiting that source code is a means of human communication

maybe surprising to people who are from a different background, that you would need to train the model

Coq Blocked

Some time ago Cibele and I were working on a fun logic project that involved encoding classical logic in Coq. Perhaps it seems odd that we would write a logic language in language that has a logic already encoded in it, but this is a classic application of PL goodness (i.e., nonsense): we used Gallina as our meta language to encode the primitives of our target language (propositional logic, for starters). You can see the fruits of our efforts here.

Our initial goal was to be able to prove the first homework assignment using just Coq. We needed to define concepts such as suitability, satisfiability, validity, etc. We ran into problems during our first pass because we tried to define everything as a function, which doesn’t lend itself to proofs.

In order to proof even the most simple theorems, we needed to define the concept of suitability. We started by defining atomic formulae and assignments:

Inductive atomic :=
| A : nat -> atomic.

Definition assignment : Set := list (atomic * bool).

Many of things we wanted to prove relied on an assignment being suitable for a formula. However, we did not want to have to traverse the assignment or prove suitability or a number of trivial cases each time we needed suitability. Our initial pass used option types extensively, but this made many of the proofs cumbersome and unruly. It also lead us astray on more complicated proofs. So, we defined some concepts with dependent types to clean things up a bit:

Fixpoint in_assignment n (a : alist) : Prop :=
match a with
| nil => False
| (h,_)::t => if beq_nat n h
then True
else in_assignment n t

Lemma in_empty : forall a, in_assignment a nil -> False.
intros; compute in H; apply H.

Fixpoint find_assignment n (a : alist) : in_assignment n a -> bool :=
match a with
| nil => fun pf => match (in_empty n) pf with end
| (h, tv)::t => if beq_nat h n
then fun _ => tv
else find_assignment n t

The above defines a function, find_assignment that gets around using option types. If we could get the above to work, then we could use in_assignment to define suitability and replace in_assignment n a with suitable n a. You can see a discussion of the problem at this gist, where we simplified assignments to be an associative list, so we wouldn’t have to worry about atomic types.

The main problem, as discussed on the gist, is that we needed to apply a proof in the inductive case of find_assignment. We couldn’t figure out how to do this because we were focused on finding errors in find_assignment. The problem with our code was actually in in_assignment: we were using beq_nat to evaluate down to a boolean, rather than using a proposition that employed sumbool (which would give us a bunch of decidibility theorems for free).

Stats Review: Infinite sets

I’m looking over Casella and Berger’s Statistical Inference and reviewing some of the concepts. For the record, this book is exactly what you want if you need to take a statistics qualifying exam as a graduate student and not at all what a generalist will want. I actually enjoyed the material for the relatively clean overview it gave. There’s depth, but not so much as to deter someone without a degree in mathematics. That said, I would not recommend this book for beginners. If you do want to slog through, get a supplementary book.

I’m currently looking at a section on page 4 of the second edition. There is a section that begins:

The operations of union and intersection can be extended to infinite collections of sets as well If $$A_1, A_2, A_3, …$$ is a collection of sets, all defined on a sample space $$S$$, then

$$\quad\bigcup_{i=1}^\infty A_i = \lbrace x \in S : x \in A_i \text{ for some } i\rbrace$$
$$\quad\bigcap_{i=1}^\infty A_i = \lbrace x \in S : x \in A_i \text{ for all } i \rbrace$$

For example, let $$S = (0, 1]$$ and define $$A_i = [(1/i), 1]$$. Then

$$\bigcup_{i=1}^\infty A_i = \bigcap_{i=1}^\infty [(1/i), 1] = \lbrace x \in (0,1] : x \in[(1/i), 1] \text{ for some } i \rbrace$$
$$\quad\quad\quad = \lbrace x \in (0, 1]\rbrace = (0,1];$$

$$\bigcap_{i=1}^\infty A_i = \bigcap_{i=1}^\infty[(1/i), 1] = \lbrace x \in (0, 1] : x \in [(1/i), 1] \text{ for all } i \rbrace$$
$$\quad\quad\quad = \lbrace x \in (0, 1] : x \in [1, 1]\rbrace = \lbrace 1 \rbrace$$ (the point 1)

The above occurred after a discussion about countable and uncountable sets and proving theorems about sets from first principles (rather than Venn diagrams). If your eyes kind of glazed over while reading the above, no worries — mine did, too. Actually, my reaction was worse than glazing over: I skimmed and thought I understood. However, as I started to tease apart what was written, I realized that there was much more going on here than I realized.

First of all, the preceding section describes reasoning about events that can be represented as finite sets, over finite sample spaces. This section builds on what we know to discuss the infinite case.

In our example, the sample space is the interval $$(0, 1]$$, which is defined over the reals. We are defining a countably infinite set of intervals, each denoted by some $$A_i$$. How do we know it’s countably infinite? This is implied by the notation: we start at 1 and go to infinity. Therefore, there is a 1-1 correspondence with the natural numbers and thus the set of intervals is infinite. Let’s take a look at some of the intervals (note that I can’t draw lines, so imagine that the dashed line is actually connecting the endpoints and they’re actually aligned):

$$A_1 : \underset{0}{\circ} \quad \quad \quad \quad \quad \quad \underset{1}{\bullet}$$
$$A_2 : \underset{0}{\circ} \quad \quad \quad \underset{\frac{1}{2}}{\bullet} \mbox{——-}\underset{1}{\bullet}$$
$$A_3 : \underset{0}{\circ} \quad \quad \underset{\frac{1}{3}}{\bullet} \mbox{———–}\underset{1}{\bullet}$$
$$A_\infty : \underset{0}{\bullet}\mbox{——————–}\underset{1}{\bullet}$$

Now, the first statement says that the union of infinite subsets of a sample space should be equal to the sample space: $$\lbrace x \in (0,1] : x \in[(1/i), 1] \text{ for some } i \rbrace$$. The statement on the far left of the set notation, $$x\in (0,1]$$, gives us $$x$$’s domain: it is defined over $$S$$, which we have defined to be $$(0,1]$$. The right side of the “such that” states that every x in the domain $$S$$ that is in some partition $$A_i$$ is in this set. Recall that $$A_i$$ was defined as $$[(1/i), 1]$$.

The above statement seems self-evident for rational numbers: for any rational number $$m/n, m > 0 \wedge n > 0$$, we know that there is a coarse-grained bound such that $$1/n$$ is less than or equal to $$m/n$$ and therefore $$m/n$$ is in $$A_n$$. But what about irrational numbers? For this to work, we would need a theorem that gives rational bounds on irrational numbers. This seems like something that ought to be out there, but I’m not sure where to look. I suspect hard-core PL theory, such as work on PCF, would have something to say about this. Number theory and/or real analysis would also be good candidates.

In any case, we are trying to make the argument that for every number on the real interval $$(0,1]$$, there exists at least one sub-interval with rational endpoints, and that the union of these sub-intervals gives us back the interval $$(0, 1]$$ exactly. We don’t miss any numbers on the interval $$(0, 1]$$. We might make some argument about the compactness of the interval.

The second statement defines intersection in this context. Here we would make an argument about the uniqueness of the intervals. That is, if $$i\not = j$$, then $$A_i \not = A_j$$. Every sub-interval is unique by virtue of its lower bound. However, the upper bound (1) is included in every interval. Therefore, the only interval all sub-intervals could share is the point 1.

It seems the point of the example was to show that countably infinite sets do occur and that the principles of set theory still hold. Defining the sample space as the reals appeals to practical considerations not fully explored in the text: we often model phenomena we measure as having infinite precision, but we know that our instruments can only be finitely precise. A countably infinite partition seems like it would introduce less error into a calculation than a finite one, i.e., a histogram.