# Formal Methods in Practice

Emery recently sent me a link to an experience report of a collection of engineers at Amazon who used formal methods to debug the designs of AWS systems. It’s pretty light on specifics, but they do report some bugs that model checking found and discuss their use of tools. They describe why they turned to formal methods in the first place (hint: it wasn’t a love of abstract algebra and proof by induction) and how they pitched formal methods as their solution to a variety of problems at Amazon.

### SAT solvers vs…Something Else

The report begins by noting that AWS is massive. As a result, statistically rare events occur with much higher frequency than the [human] users’ gut notion of rare events. The first anecdote describes how one of the authors found some issues in “several” of the distributed systems he either designed or reviewed for Amazon. Although they all had extensive testing to ensure that the executed code would comply with the design, the author realized that these tests were inadequate if the design was wrong. Since he was aware that there were problems with edge cases in the design, he turned to formal methods to ensure that the system was actually exhibiting the behaviors he expected.

His first attempt at proving correctness of these systems was to use the Alloy specification language. Alloy’s model checking is powered by a SAT solver. At a high level, SAT solvers model many states of the world (i.e., a program or system) as boolean propositions. If we wanted to talk about a distributed system, we might have propositions like MasterServerLive or PacketSent. These propositions have true or false values. Nothing that isn’t explicitly defined can be described.

If you are familiar with propositional and predicated logic, this should all be trivial! However, if you’re not, you might be wondering what we mean by “nothing that isn’t explicitly defined can be described.” Consider this: I have 10 computers from 1995 in my basement. I split them into two clusters. Each has one queen and four drones (1). I’d like to state some property foo about each queen. Since I have two queens, I would have two propositions: FooQueen1 and FooQueen2. If I never need to use either of these propositions separately (that is, if they always appear in expressions looking like this: $$FooQueen1 \wedge FooQueen2$$), then I can just define one proposition, $$FooQueen$$. If, however, I want to say something like “foo is true for at least one of the queens,” the I would need each of FooQueen1 and FooQueen2.

Every time I want to assert some property of these two queens, I will have to create two propositions to express them. Furthermore, if I want to say something we’ll call about a relationship bar between a queen and each of its drones, I will have create variables BarQueen1Drone1, BarQueen1Drone2, …, BarQueen2Drone4. Any relationship or property that isn’t described in this way cannot be reasoned about.

Anyone who has experience with first order logic might point out that there is a much more elegant solution to propositional logic — we could always just describe these properties as predicates over a universe. Instead of defining two propositions FooQueen1 and FooQueen2, we could say $$\forall x \in \lbrace computers \rbrace: Queen(x) \rightarrow Foo(x)$$, which states that if one of my basement computers is a queen, then foo holds. This expression gives us the possibility of adding another queen to the system without having to explicitly add more propositions.

So far so good, right? Well, there’s a another problem with using SAT solvers for distributed systems — timing information matters. Some processes are redundant, and we need a way to describe concurrency. We could try to define predicates like Before and Concurrently, but again our model explodes in the number of “things” we need to specify. Tools like Alloy have no notion of timing information — we have to tell it what temporal ordering means.

The author found that Alloy simply was not expressive enough to describe the kinds of concurrency invariants he needed. He instead started using TLA+, a modelling language based on temporal logic, with logical “actions” (I think I learned about this in a class I took on logic in NLP, but I don’t recall much about it and the Wikipedia page is empty). This language was expressive enough to describe the systems he wanted to model. Others at Amazon independently came to the same conclusion. Some found a language on top of TLA+, called PlusCal more helpful.

### Prejudice

There seems to be a lot of suspicion (if not prejudice) in the “real world” of formal methods. The authors decided to market their approaches as “debugging designs,” rather than “proving correctness.” This part was the real hook for me. I’ll admit, formally proving things can be quite satisfying. Proving correctness or invariance can feel worthy in their own right. Unfortunately, not all problems may be worthy of this effort.

A major problem in the adoption of these tools is that they aren’t marketed to the right people in the right way. Pitching formal methods as a way to “debug designs” has a lot of appeal to pragmatists — it emphasizes the idea that designs themselves may be flawed, and these flaws may be difficult to find. The authors cite medical devices and avionics as fields that value correctness, because the cost of bugs is very high. However, in most software systems, the cost is much lower. A bug that hasn’t been exposed in testing is probably quite infrequent. Its infrequency means that the cost of paying someone to ensure a program’s correctness outweighs the cost of the consumer’s exposure to the bug. What the authors realized, and what they pitched in this paper, was the idea that an event that only occurs 1 out of a billion executions is no longer a rare event in real software systems. There is a mismatch between the scale at which we test and develop and the scale at which we run production code. We may not be able to extrapolate that a program is correct or a system robust to failure on the basis of testing that are orders of magnitude below actual load.

### How this relates to me

Our pitch for SurveyMan and related tools is similar to the authors’ pitch for using formal methods for large systems. We’re starting from the assumption that the design of the survey (or experiment) may be flawed in some way. Rather than deploying costly studies, only to find out months or even years later that there were critical flaws, use formal methods instead. Our tools don’t look like formal methods, but they are. We construct a model for the survey on the basis of its structure and a few invariants. We use this data to statically analyze and help debug the results of a survey. The author’s model checking and our random respondent simulations are both used to check assumptions about the behavior of the system.

### Footnotes

1. Technically, the “proper” terminology here is master/slave, but that has an inhumane, insensitive ick factor to it, so I’ve decided to start calling it a queen/drone architecture instead.