Category Archives: Formal Methods

Coq Blocked

Some time ago Cibele and I were working on a fun logic project that involved encoding classical logic in Coq. Perhaps it seems odd that we would write a logic language in language that has a logic already encoded in it, but this is a classic application of PL goodness (i.e., nonsense): we used Gallina as our meta language to encode the primitives of our target language (propositional logic, for starters). You can see the fruits of our efforts here.

Our initial goal was to be able to prove the first homework assignment using just Coq. We needed to define concepts such as suitability, satisfiability, validity, etc. We ran into problems during our first pass because we tried to define everything as a function, which doesn't lend itself to proofs.

In order to proof even the most simple theorems, we needed to define the concept of suitability. We started by defining atomic formulae and assignments:

Inductive atomic :=
| A : nat -> atomic.

Definition assignment : Set := list (atomic * bool).

Many of things we wanted to prove relied on an assignment being suitable for a formula. However, we did not want to have to traverse the assignment or prove suitability or a number of trivial cases each time we needed suitability. Our initial pass used option types extensively, but this made many of the proofs cumbersome and unruly. It also lead us astray on more complicated proofs. So, we defined some concepts with dependent types to clean things up a bit:

Fixpoint in_assignment n (a : alist) : Prop :=
match a with
| nil => False
| (h,_)::t => if beq_nat n h
then True
else in_assignment n t

Lemma in_empty : forall a, in_assignment a nil -> False.
intros; compute in H; apply H.

Fixpoint find_assignment n (a : alist) : in_assignment n a -> bool :=
match a with
| nil => fun pf => match (in_empty n) pf with end
| (h, tv)::t => if beq_nat h n
then fun _ => tv
else find_assignment n t

The above defines a function, find_assignment that gets around using option types. If we could get the above to work, then we could use in_assignment to define suitability and replace in_assignment n a with suitable n a. You can see a discussion of the problem at this gist, where we simplified assignments to be an associative list, so we wouldn't have to worry about atomic types.

The main problem, as discussed on the gist, is that we needed to apply a proof in the inductive case of find_assignment. We couldn't figure out how to do this because we were focused on finding errors in find_assignment. The problem with our code was actually in in_assignment: we were using beq_nat to evaluate down to a boolean, rather than using a proposition that employed sumbool (which would give us a bunch of decidibility theorems for free).

Logic and Probability

When I first started at UMass, I had effectively no background in statistics or probability. So, when I was taking the first course in the graduate stats sequence, I tried to frame what I was learning in terms of things I already understood. When I saw the conditional probability \mathbb{P}(Y\;\vert\; X), I couldn't help but think:

\begin{array}{|l} X \\ \hline \vdots \\ Y \\ \hline \end{array}\\ X \rightarrow Y

Assumption seems to be a close analogy of observation, and if we analyze each construct operationally, they both have a strict order (i.e., observe/assume X, then derive/calcuate the probability of Y). Both hold X fixed in some way for part of the calculation. Suppose we then say that X implies Y with some probability p. If we denote this as X \overset{p}{\rightarrow} Y, then we have some equivalence relation where X \overset{p}{\rightarrow} Y \equiv \mathbb{P}(X\rightarrow Y) = p \equiv \mathbb{P}(Y\;\vert\;X) = p.

Since X \overset{p}{\rightarrow} Y is just normal logical implication, with a probability attached, we should be able to use the usual rewrite rules and identities (after all, what's the point of modeling this as a logic if we don't get our usual identities, axioms, and theorems for free?). In classical logic, implication is short for a particular instance of disjunction: X \rightarrow Y \hookrightarrow \neg X \vee Y. We can then rewrite our probabilistic implication as \neg X \overset{p}{\vee} Y and say \mathbb{P}(\neg X \vee Y) = p \equiv \mathbb{P}(\neg X \cup Y) = p.

Similarly, we want to have the usual rules of probability at our disposal, so by the definition of conditional probabilities, \mathbb{P}(Y\;\vert\; X) = \frac{\mathbb{P}(Y\;\cap\;X)}{\mathbb{P}(X)}. We can apply the above rewrite rule for implication to say \mathbb{P}(\neg X \cup Y) = p \equiv \frac{\mathbb{P}(Y\;\cap\;X)}{\mathbb{P}(X)} = p. This statement must be true for all events/propositions X and Y.

Let's take a closer look at a subset of events: those where X is independent of Y, denoted X \bot Y. Independence is defined by the property \mathbb{P}(Y\;\vert\; X)=\mathbb{P}(Y). From this definition, we can also derive the identities \mathbb{P}(X\cap Y) = \mathbb{P}(X)\mathbb{P}(Y) and \mathbb{P}(X\cup Y) = \mathbb{P}(X) + \mathbb{P}(Y). Now we can rewrite \mathbb{P}(\neg X \cup Y) = p \equiv \frac{\mathbb{P}(Y\;\cap\;X)}{\mathbb{P}(X)} = p as \mathbb{P}(\neg X) + \mathbb{P}(Y) = p \equiv \mathbb{P}(Y) = p. Since the relations on either side are equivalent, we can then substitute the right into the left and obtain \mathbb{P}(\neg X) = 0 \equiv \mathbb{P}(Y) = p. Although this looks a little weird, it's still consistent with our rules: we're just saying that when the events are independent (a notion that has no correspondence in our logical framework), the probability of the implication (i.e., the conditional probability) is wholly determined by Y -- if X happens (which it will, almost surely) then Y's marginal is p. If X never happens (which it won't), then Y is 0, and the probability of the whole implication is 0.

Now let's consider how this works over events that are not independent. For this example, let's gin up some numbers:

\mathbb{P}(X) = 0.1 \quad\quad \mathbb{P}(Y) = 0.4 \quad\quad \mathbb{P}(X \cap Y) = 0.09.

Note that X\not\bot\; Y because \mathbb{P}(X\cap Y) \not = 0.04. Recall that because either X or Y are supersets of X\cap Y, their marginals cannot have a lower probability than their intersections.

Now let's compute values for either side of the equivalence \mathbb{P}(\neg X \cup Y) = p \equiv \mathbb{P}(Y\;\vert\; X) = p. First, the conditional probability:

\mathbb{P}(Y\;|\; X) = \frac{\mathbb{P}(Y\cap X)}{\mathbb{P}(X)} = \frac{0.09}{0.1} = 0.9 = p

Now for the left side of the equivalence, recall the definition of union:
\mathbb{P}(\neg X \cup Y) = \mathbb{P}(\neg X) + \mathbb{P}(Y) - \mathbb{P}(\neg X \cap Y).

Since we don't have \mathbb{P}(\neg X \cap Y) on hand, we will need to invoke the law of total probability to compute it: \mathbb{P}(\neg X \cap Y) = \mathbb{P}(Y) - \mathbb{P}(X\cap Y) = 0.4 - 0.09 = 0.31.

We can now substitute values in:
\mathbb{P}(\neg X \cup Y) = 0.9 + 0.4 - 0.31 = 0.99 = p.

Now our equivalence looks like this:
\mathbb{P}(\neg X \cup Y) = 0.99 \equiv \mathbb{P}(Y\;\vert\; X) = 0.9,
which isn't really much of an equivalence at all.

So what went wrong? Clearly things are different when our random variables are independent. Throughout the above reasoning, we assumed there was a correspondence between propositions and sets. This correspondence is flawed. Logical propositions are atomic, but sets are not. The intersection of non-independent sets illustrates this. We could have identified the source of this problem earlier, had we properly defined the support of the random variables. Instead, we proceeded with an ill-defined notion that propositions and sets are equivalent in some way.

Formal Methods in Practice

Emery recently sent me a link to an experience report of a collection of engineers at Amazon who used formal methods to debug the designs of AWS systems. It's pretty light on specifics, but they do report some bugs that model checking found and discuss their use of tools. They describe why they turned to formal methods in the first place (hint: it wasn't a love of abstract algebra and proof by induction) and how they pitched formal methods as their solution to a variety of problems at Amazon.

SAT solvers vs...Something Else

The report begins by noting that AWS is massive. As a result, statistically rare events occur with much higher frequency than the [human] users' gut notion of rare events. The first anecdote describes how one of the authors found some issues in "several" of the distributed systems he either designed or reviewed for Amazon. Although they all had extensive testing to ensure that the executed code would comply with the design, the author realized that these tests were inadequate if the design was wrong. Since he was aware that there were problems with edge cases in the design, he turned to formal methods to ensure that the system was actually exhibiting the behaviors he expected.

His first attempt at proving correctness of these systems was to use the Alloy specification language. Alloy's model checking is powered by a SAT solver. At a high level, SAT solvers model many states of the world (i.e., a program or system) as boolean propositions. If we wanted to talk about a distributed system, we might have propositions like MasterServerLive or PacketSent. These propositions have true or false values. Nothing that isn't explicitly defined can be described.

If you are familiar with propositional and predicated logic, this should all be trivial! However, if you're not, you might be wondering what we mean by "nothing that isn't explicitly defined can be described." Consider this: I have 10 computers from 1995 in my basement. I split them into two clusters. Each has one queen and four drones (1). I'd like to state some property foo about each queen. Since I have two queens, I would have two propositions: FooQueen1 and FooQueen2. If I never need to use either of these propositions separately (that is, if they always appear in expressions looking like this: FooQueen1 \wedge FooQueen2), then I can just define one proposition, FooQueen. If, however, I want to say something like "foo is true for at least one of the queens," the I would need each of FooQueen1 and FooQueen2.

Every time I want to assert some property of these two queens, I will have to create two propositions to express them. Furthermore, if I want to say something we'll call about a relationship bar between a queen and each of its drones, I will have create variables BarQueen1Drone1, BarQueen1Drone2, ..., BarQueen2Drone4. Any relationship or property that isn't described in this way cannot be reasoned about.

Anyone who has experience with first order logic might point out that there is a much more elegant solution to propositional logic -- we could always just describe these properties as predicates over a universe. Instead of defining two propositions FooQueen1 and FooQueen2, we could say \forall x \in \lbrace computers \rbrace: Queen(x) \rightarrow Foo(x), which states that if one of my basement computers is a queen, then foo holds. This expression gives us the possibility of adding another queen to the system without having to explicitly add more propositions.

So far so good, right? Well, there's a another problem with using SAT solvers for distributed systems -- timing information matters. Some processes are redundant, and we need a way to describe concurrency. We could try to define predicates like Before and Concurrently, but again our model explodes in the number of "things" we need to specify. Tools like Alloy have no notion of timing information -- we have to tell it what temporal ordering means.

The author found that Alloy simply was not expressive enough to describe the kinds of concurrency invariants he needed. He instead started using TLA+, a modelling language based on temporal logic, with logical "actions" (I think I learned about this in a class I took on logic in NLP, but I don't recall much about it and the Wikipedia page is empty). This language was expressive enough to describe the systems he wanted to model. Others at Amazon independently came to the same conclusion. Some found a language on top of TLA+, called PlusCal more helpful.


There seems to be a lot of suspicion (if not prejudice) in the "real world" of formal methods. The authors decided to market their approaches as "debugging designs," rather than "proving correctness." This part was the real hook for me. I'll admit, formally proving things can be quite satisfying. Proving correctness or invariance can feel worthy in their own right. Unfortunately, not all problems may be worthy of this effort.

A major problem in the adoption of these tools is that they aren't marketed to the right people in the right way. Pitching formal methods as a way to "debug designs" has a lot of appeal to pragmatists -- it emphasizes the idea that designs themselves may be flawed, and these flaws may be difficult to find. The authors cite medical devices and avionics as fields that value correctness, because the cost of bugs is very high. However, in most software systems, the cost is much lower. A bug that hasn't been exposed in testing is probably quite infrequent. Its infrequency means that the cost of paying someone to ensure a program's correctness outweighs the cost of the consumer's exposure to the bug. What the authors realized, and what they pitched in this paper, was the idea that an event that only occurs 1 out of a billion executions is no longer a rare event in real software systems. There is a mismatch between the scale at which we test and develop and the scale at which we run production code. We may not be able to extrapolate that a program is correct or a system robust to failure on the basis of testing that are orders of magnitude below actual load.

How this relates to me

Our pitch for SurveyMan and related tools is similar to the authors' pitch for using formal methods for large systems. We're starting from the assumption that the design of the survey (or experiment) may be flawed in some way. Rather than deploying costly studies, only to find out months or even years later that there were critical flaws, use formal methods instead. Our tools don't look like formal methods, but they are. We construct a model for the survey on the basis of its structure and a few invariants. We use this data to statically analyze and help debug the results of a survey. The author's model checking and our random respondent simulations are both used to check assumptions about the behavior of the system.


  1. Technically, the "proper" terminology here is master/slave, but that has an inhumane, insensitive ick factor to it, so I've decided to start calling it a queen/drone architecture instead.