# Wellformedness = probability?

There are some old arguments against probabilistic models as models of language, but these do not seem to have much force anymore, especially because we now have models that can compute probabilities over the same representations that we use in generative linguistics (Andries Coetzee and I have an overview of probabilistic models of phonology in our Handbook chapter, Mark Johnson has a nice explanation of the development of MaxEnt models and how they differ from PCFG’s as well as other useful material on probabilistic models as models of language learning, Steve Abney has a provocative and useful piece about how the goals of statistical computational linguistics can be seen as the goals of generative linguistics; see more broadly the recent debate between Chomsky and Peter Norvig on probabilistic approaches to AI; see also the Probabilistic Linguistics book and Charles Yang’s review).

That’s not to say that there can’t be issues in formalizing probabilistic models of language. In a paper to appear in Phonology (available here) Robert Daland discusses issues that can arise in defining a probability distribution over the infinite set of possible words, in particular with Hayes and Wilson’s (2008) MaxEnt phonotactic grammar model. In the general case, for this to succeed, the probability of strings of increasing length must decrease sharply enough such that the sum of their probabilities never exceeds 1, and simply continues to approach it. Daland defines the conditions under which this will obtain in the Hayes and Wilson model in terms of the requirements on the weight of a *Struc constraint that assigns a penalty that increases as string length increases.

In the question period after Robert’s presentation of this material at the GLOW computational phonology workshop in Paris in April, Jeff Heinz raised an objection against the general notion of formalizing well-formedness in terms of probabilities, and he repeated this argument at the Manchester fringe workshop last week. Here’s my reconstruction of it (hopefully Jeff will correct me if I get it wrong – I also don’t have the references to the earlier work that made this argument). Take a (relatively) ill-formed short string. Give it some probability. Now take a (relatively) well-formed string. Give it some probability. Now concatenate the well-formed string enough times until the whole thing has probability lower than the ill-formed string, which it eventually will.

This is meant to be a paradox for the view that we can formalize well-formedness in terms of probabilities: the long well-formed string has probability lower than the short ill-formed string. It’s not clear to me, however, that there is a problem (and it wasn’t clear to Robert Daland either – the question period discussion lasted well into lunch, with Ewan Dunbar taking up Jeff’s position at our end of the table). Notice that Jeff’s argument is making an empirical claim that the concatenation of the well-formed strings does not result in a well-formedness decrease. When I talked to him last week, he claimed that this is clearer in syntax than phonology. Robert’s position (which I agree with) is that it likely does – though from his review of the literature on phonotactic well-formedness judgments we don’t seem to have empirical data on this point.

Robert asked us to work with him in designing the experiment, and at the time I wasn’t sure that this was the best use of our lunch time, but I think he has a point. If this is in fact an empirical issue, and we can agree beforehand on how to test it, then this would save a lot of time compared with the usual process of the advocates of one position designing an experiment, which even if it turns out the way they hope, can then be criticized by the advocates of the other position as not having operationalized their claim properly, and so on…

It’s also of course possible that this is not an empirical issue: that there is a concept of perfect well-formedness that probabilistic models cannot capture. This reminds me of a comment on a talk I got once from a prominent syntactician when I discussed probabilistic models that can give probability vanishingly close to zero to ill-formed structures: “but there are sentences that I judge as completely out for English – they should have probability zero”. My response was to simply repeat the phrase vanishingly close to zero, and check to make sure he knew what I meant.

## 28 thoughts on “Wellformedness = probability?”

1. Ewan

Well, it’s still an empirical issue if there’s a concept of well-formedness that probabilistic models can’t capture. (That’s what the experiment is supposed to be testing.)

It sounds like maybe you’re suggesting that an epicycle theory with a non-probabilistic grammar G and probabilities over it P could propose that behavioral measures are inevitably a combination of G and P, making G hard to get at by any measure. That’s a different situation again. It would mean that there’s a notion of well-formedness which is separate from probabilities *and* which is hard to get at empirically. It would mean there are two competing theories, G + P, and unitary Pg. The two would be intertranslatable as far as predicting judgments are concerned.

Still, hard to get at is not the same as impossible. Whether the difference between the two would be empirical would depend on what you’re asking the theory to be responsible for. For example, if you had an independent theory of P, in which it wasn’t just about words. The scope of G and P would be different and they’d have to play nice together. Or if, for example, the moving parts of G had to also explain acquisition in a very particular way.

Luckily the issue here isn’t that complex. Here it’s just whether *any* probabilistic theory is plausible at all. If we can’t get the likelihood to be finite, we have a problem. (If it doesn’t work we could build an epicycle theory in the other direction Pg + N, where the basic grammar is probabilistic and some other module N is responsible for their unreasonable non-normalizability. Frankly, I wouldn’t be too displeased in any case if it got us some empirical work done. Some of the most precise astronomy was done under pretty wrong theories, but the fine empirical details were what allowed us to go on and discover better theories.)

The issue is fundamental and sounds obvious, which is why I think we need to test it. It’s one of these very basic things that always seemed so obviously true to me that I didn’t think we *needed* to test it. The more I think about it, as linguists I think we do that a lot, and I’m not sure that’s been good for the field.

1. Joe Pater Post author

Right – it’ll be an empirical issue as long as “we” (whoever the relevant we is) can agree on how to operationalize well-formedness. My anecdote was meant to hint at a worry that there may be a conceptual version of “well-formed” that cannot be operationalized and is fundamentally incompatible with a probabilistic interpretation.

Thanks for the thoughts – very interesting.

2. Robert Daland

Indeed, thanks Joe for a cogent summary.

I will be teaching Experimental Phonology next year, and a very attractive option is to do the required experiment. I would quite like to have Jeff’s feedback beforehand, as it would be great to know that we all could agree beforehand on how to interpret the results.

To start the ball rolling, we can observe that there are all kinds of extra-grammatical factors that condition listeners’ acceptability judgments, including especially task factors. But hopefully we can all agree to abstract away from that if the task is controlled? I think most of us would agree that well-formedness will generally decrease with the length of a grammatical sequence — but this claim is perfectly consistent with the possibility that many very long sequences can be perfectly grammatical. If I recall correctly, one person (perhaps Jeff) suggested that “perceived well-formedness” is actually normalized against length. This would mean something like, how good is this sequence of n elements GIVEN THAT it has n elements? The alternative hypothesis that is most directly supported by the limited data I am aware of is that perceived well-formedness is proportional to the log-probability assigned by an appropriately sensitive statistical model. That is my understanding of the two major contenders in the ring. (Note that both versions have trouble with the “mrupation”-like items from Coleman & Pierrehumbert, 1997; and Daland et al. 2011 *and* Albright, 2011? found that [mr] onsets were rated better by English listeners than they should be…)

Perhaps Jeff could weigh in and give a thumbs up or a correction for my statement of the position above. And as for Ewan, I have to confess I don’t quite understand any of the models you are proposing, so I don’t have any sense for what kind of contrasting predictions they do or might make.

1. Ewan

Right, well, I didn’t really propose a model. I just said that either you think the probabilities are assigned by the grammar or you don’t. And, if you are going to claim that gradient judgments are attributable to something other than the grammar then, in order for that claim to be empirically distinct, you should say what that “thing” is in a way that’s dissociable from the grammar.

But I think we need to think about that *only* in the case that the behavioral measure comes back getting worse in the length of the word.

To me the first pass version sounds like a low-cost-risky-in-only-one-direction experiment. Take the most naive measure possible of well-formedness, just ask for ratings. Control the stimuli a little bit – optimize for high phonotactic well-formedness and low morphological decomposability – and let them go out to being painfully long to be sure. And then see what happens. If the tail is relatively flat, then, prima facie, Jeff’s right.

It’s only if not that we would need to think about all the competing models for gradience.

3. Jeff Heinz

First I would like to make clear what I think the issue is, and then I will respond to some of the comments.

The facts are these. If you have

1) a probability distribution over Sigma* such that
2) there are infinitely many forms with nonzero probability and
3) there is at least one ill-formed word with nonzero probability (say it is epsilon)

then there are infinitely many forms with probability less than epsilon.

Why is this a problem?

From my perspective, we want to say there are infinitely many well-formed linguistic structures (words, sentences, etc). (I refrain from explaining here WHY we want this, but I can explain if you ask.) But if probability equals well-formedness then it follow from the above that there are only finitely well-formed structures (since only finitely many forms will have probability greater than epsilon.)

I do think the issue is easier to see in syntax than in phonology, but I believe the arguments go through in phonology as well. The sentence ‘Bird the flew’ is not well-formed. Give it some nonzero probability, epsilon. Now for any epsilon you give me, I can give you a perfectly well-formed English sentence which would have to have probability less than epsilon. The sentence would be very long, but still structurally pretty simple. Something like “John laughed and laughed and laughed…” or “John laughed and Mary sang and Joe danced and Jane went to the park in a scooter that was made of plastic imported from Latin America and Sally saw the man that …”

I don’t think people judge the long sentence to be worse just because it can be quite long. I suspect people are examining local parts of it and judging each of those as grammatical.

For phonotactics, the argument is like this. A word like ‘bzaSrk” has some non-zero probability, say epsilon. For any epsilon you give me, I can give you a long word which will have to have a probability lower than epsilon. Perhaps something like this “kapalatSakolapinipisaukimalgalanu” . I think I can make even longer ones if needed. Again, I think people are probably judging local parts of the word, not the whole thing.

Robert claims this is an empirical issue, but I am not so sure. I wonder if it is conceptual. I agree with Robert that many “extra-grammatical factors that condition listeners’ acceptability judgments”. I don’t agree with the statement that “most of us would agree that *well-formedness* will generally decrease with the length of a grammatical sequence [emphasis added]” I agree that IF *acceptability* judgements of phonotactic strings were shown to decrease as length increased I would not be too surprised since extra-grammatical factors are plausibly responsible. (And I think Robert may have meant “acceptability” and not “well-formedness” here.)

For me, the following concepts—acceptability of a word, well-formedness of a word, and likelihood of a word—are logically distinct concepts. Hypotheses that equate them are powerful indeed. Much ink has been spilled on the distinction between acceptability and grammaticality (Sch\”utze’s book for instance). (I suppose I am equating grammaticality with well-formedness.) The hypothesis that likelihood, or probability, is the same as grammaticality is a BIG hypothesis, but faces the obstacle that I raised at the beginning. We want the generalizations (or extensions thereof) to describe these infinite sets and we want to assign low nonzero probabilities to things we deem bad in some way. But then you will have infinitely many things below these low probabilities and only finitely many above them.

Robert suggests one way which is to normalize by length, which is something I may have mentioned (I remember talking to Vijay Shankar about this a few years ago). That’s certainly one way, but it makes it hard to compare structures of different sizes. More generally the idea is to transform the probability in some way into an acceptability or grammaticality judgment. Clark and Lappin have looked at this problem in the context of syntax both in their in 2011 book and also in a 2013 CMCL paper (they also have a paper that year at the cogsci conference). They state the nature of the problem quite well: “Oversimplifying somewhat, we are trying to reconstruct a gradient notion of grammaticality which is derived from probabilistic models, that can serve as a core component of a full model of acceptability.” Sespite their efforts in this paper, there is still more work to be done.

I am glad they (and us) are thinking about it though. It’s a tough problem and one worth working on, probably from both empirical, conceptual, and theoretical approaches.

4. William Idsardi

Here’s Geoff Pullum making the same point as Jeff (and attributing the observation to Gerald Gazdar), starting at the bottom of page 6 (substitute “iteration” for “recursion” in the first sentence of the paragraph if you like).

http://www.lel.ed.ac.uk/~gpullum/postMoLpaper.pdf

The relevant quote is on page 7:

“In other words, in an infinite probabilistic language, probabilities for grammatical expressions can fall arbitrarily low, so no cutoff could ever be low enough to function as a surrogate for the grammaticality / ungrammaticality threshold and separate off the expressions that are definitely illformed but have a non-zero likelihood of occurring.”

5. Mark Johnson

This is an interesting question, and I think I basically agree with Jeff’s comments above, and in addition I think that grammaticality / well-formedness probably can’t be defined by a simple probability threshold. I’m not sure what the right way to understand grammaticality is: I think it depends on your theory of grammar. If the Harmony / OT story is roughly right, then a surface form is grammatical if it’s the optimal output for some underlying form. As I’m sure you all know, you can define a stochastic version of Harmony theory as a two-stage generative process: some process chooses underlying forms, and then a MaxEnt-type conditional distribution maps those to surface forms. Under this kind of model, the grammatical outputs (i.e., those maximising the conditional distribution) aren’t necessarily those with highest marginal probability.

I’d also like to add that for a language user, actually determining the grammaticality of an utterance is not really important for either comprehension or production. E.g., in comprehension, the comprehender is trying to determine the most likely meaning the speaker was trying to convey, i.e., solve an optimisation problem rather than check whether certain constraints are satisfied.

6. Thomas Graf

I don’t quite understand why you’d think that this issue could be decided by an experiment. An experiment always tests the conjunction of the grammar and the cognitive system it is embedded in, but the claim under discussion here is just about the grammar. At the end of the day, it boils down to the competence/performance distinction and why this is a good distinction to make.

First, Jeff is right that the issue surfaces more clearly in syntax. Take any unboundedly productive construction like center embedding. With every level of embedding, the probability of the sentence decreases, until it eventually drops below whatever threshold you fixed as demarcation between grammatical and ungrammatical. A tempting way around that is to identify well-formedness not with the actual probability of the string but rather probability normalized with respect to sentence length. But that doesn’t work because then you can pad out ill-formed sentences until they become well-formed, e.g. by throwing in adjuncts, of which you can always have an unbounded number per clause. So you have to do something more sophisticated like normalize with respect to the level of center embeddings and adjuncts. Congratulations: you’ve just smuggled the good old binary well-formed/ill-formed system into your probabilistic setting by identifying a class of constructions that preserve well-formedness under unbounded iteration. At that point you might just as well drop probabilities altogether.

But dropping probabilities means that you have a hard time accounting for anything that involves gradience, which is a much more commonly studied topic in phonology than syntax, so I can understand why you would be disinclined to go that route. Two points:

1) Gradience does not imply the use of probabilities. Probabilistic formalisms are just a special case of weighted formalisms, and there’s many weighted systems where the final value of a sentence does not linearly decrease with its length. Take for example logics with multiple truth values, which have been proposed to model presupposition failures in semantics. Clearly the length of a sentence has no effect on whether it’s true, false, or a presupposition failure.

2) The competence/performance distinction is useful irrespective of whether it is ontologically true. There’s many arguments in its favor, but the most important one is that without this distinction you necessarily end up with finite cut-off points everywhere (only so many levels of embedding, long-distance dependencies can only be applied so far before speakers forget about them), and those hide important generalizations. A finite-state grammar for center-embedding is a giant monstrosity that completely fails to capture basic facts, e.g. that word order does not vary with levels of embedding (an automaton where word order at every other level of embedding is OVS instead of SVO would be just as complex as one that has the same word order everywhere). A CFG provides a much smaller description and explains why all levels behave the same, but it also claims that grammaticality does not taper off, even though acceptability might.

The moral is that you want to abstract away from real-world behavior in at least some cases, which requires that you have a theory that still has something to say at this level of abstraction. That’s not the case if all your work is done by probabilities.

7. Mark Johnson

I’m at a conference writing this late at night, so I haven’t read everything carefully, but the post and all the comments I’ve read seem reasonable.

Myself, I’d be surprised if there were a simple probability threshold that characterises grammaticality. I suspect that if something like a Harmony Theory / OT account is correct, the relationship would be more complex. Specifically, a surface form S is grammatical if it is the optimal surface form for some underlying form U. If the probability of underlying forms U can vary wildly, then the probability of optimal surface forms should also vary wildly, too.

The kind of generative model I have in mind assumes that “optimal” = “most probable”, so the generative model looks something like:

P(S,U|C) = P(S|U) P(U|C)

where C contains the linguistic and non-linguistic information determining the underlying form U.

Under this formulation, S is grammatical iff there exists a U’ and a C’ such that:

S = argmax_S’ P(S’|U’) P(U’|C’)

The main weakness I see with this account is that it basically assumes there is no ineffability; each underlying form U can be expressed by some grammatical surface form S. But I think this is a weakness of all optimisation-based approaches, such as Harmony Theory and Optimality Theory.

1. Joe Pater Post author

Thanks for your thoughts Mark (and everyone else!). Lots to mull over…

There is actually one fairly natural way of getting categorical grammaticality and gradient acceptability out of an HG system (I recall recently seeing something of Jon Sprouse’s on the need for both). You can make your choices for a given input as in standard OT, picking the best output; this is grammaticality. Then you say acceptability is derived from the numerical Harmony scores across (input, output) pairs. This is essentially what Andries Coetzee and I did in our paper on gradient phonotactics:

I think this is how Keller’s Linear OT works – the paper has references to that, Boersma’s critique, and a possible solution.

I think Hayes and Wilson’s model in a better model of phonotactics though – see Wilson and Obdeyn’s paper for a comparison (this is a version I posted a long time ago for a discussion and it’s the only one I found googling it – CW let me know if I should take it down).

http://people.umass.edu/hgr/WilsonObdeynSeptember2009.pdf

Here’s a squib I wrote in 2007 but never circulated that gives references on ineffability in HG / OT and makes a proposal that may be related to some of our discussion:

http://blogs.umass.edu/comphon/files/2015/06/pater-recover.pdf

The Hayes and Wilson (2008) model is defining a distribution over possible words (i.e. Input = WORD as Robert points out in his upcoming Phonology article), so I don’t think the issue arises there…

8. Joe Pater Post author

Here are some further thoughts due to an off-line conversation with Steve Abney.

First, it may be worth pointing out that the Hayes-Wilson model that Daland is working with defines a probability distribution over word-forms in a corpus, but does not actually model the probabilities of particular lexical items (e.g. that ‘the’ is frequent). Therefore, the notion of well-formed but low probability sequences which comes up in discussion of syntactic parsing simply can’t arise there.

Second, while we have phonologists and syntacticians gathered in the same place, I’m curious to find out whether work along the lines that Robert and many others are pursuing in phonology is being done in syntax. This line of work starts with Ernestus and Baayen (2003: Language), along with Adam Albright and Kie Zuraw in their dissertations and Bruce Hayes in collaboration with them and others (it also has connectionist antecedents in work on the past-tense and generative precursors in Dresher and Kaye and Tesar and Smolensky, etc.). The goal is to train a learner on a corpus and then see if the judgments it produces for nonce words or wug tests matches experimental data. I recall a talk at the “Were we wrong all along?” conference organized by Berwick in which the speaker showed that models trained with NLP goals with NLP features on NLP corpora did not produce judgments that looked anything like humans. But has anyone been training models with the types of representations we use in generative linguistics to try to match judgments? I’m familiar with the generative syntax learning work summarized and extended in Charles Yang’s book, and if I recall correctly, Steve Abney’s article linked in my original post is arguing for this sort of approach (I didn’t reread it when posting). But references to later work would be very much appreciated.

1. Joe Pater Post author

I was particularly interested in syntax, but I hadn’t seen Colavin’s dissertation yet, so thanks for that! Clearly, we need to be doing a better job of sharing our work – as you mentioned off-line, a general phonology archive, or maybe even just an index with links (since many papers are already electronically archived), would probably be a good idea at this point.

9. Steve Abney

To digest the off-line discussion I had with Joe:

First, the original question is posed in terms that are an invitation to confusion. The question is really whether training a language model gives a good model of well-formedness. A good model of either sort (language model or well-formedness model) is likely to be probabilistic.

A language model is constrained to sum to one over all items in the space (whether the items are sentences or word candidates), whereas a well-formedness model is under no such constraint. The natural probability in a well-formedness model is the probability that a human says “yes” if asked to judge the well-formedness of an item. Sum that over all items, and you get infinity, not one. *Obviously*, well-formedness models and language models are not the same thing.

A more refined question is whether normalizing a well-formedness model might give you a language model. The question the discussion has been focussing on is whether the normalization is even possible – it’s not possible if the sum does not converge, and the sum does not converge if there are infinitely many items with non-vanishing well-formedness. But there are other reasons to be skeptical.

1. Consider “the” versus “teepee”. Surely, “teepee” is phonotactically simpler (more well-formed) than “the”, but obviously “the” has a higher probability in the language model. Any inversion like that tells you that the language model is not just a normalized version of the well-formedness model.

2. The accepted wisdom in computational linguistics, based on many past failures, is that if you do grammatical inference by training a language model, you do not get grammars that correspond to human structural judgements.

If you want a good grammar, you need a good model of grammaticality, and the signal from language modeling is just too indirectly related to structure to be very useful. The current best grammatical inference methods involve training a language-independent parser from one or more “training” treebanks – think of the language-independent parser as a very practical representation of UG – and then use it to bootstrap language-specific parsers in new languages.

3. It would be plausible to use well-formedness probabilities as a filter on an underlying item-generation process, but you won’t end up with a language model that is *proportional* to the well-formedness model. What plausible stochastic process would even give rise to proportionality between the models?

1. Ewan

Are you using the term “well-formedness” to mean “acceptability” in the context of a judgment? I think Joe was using it to mean “degree of grammaticality assigned by the competence grammar,” which is rather different, and is at the crux of this discussion.

1. Joe Pater Post author

Phonologists, including me, often conflate well-formedness and acceptability. For me, this can lead to some interesting conversations with students who work with syntactic psycholinguists (e.g. Claire Moore-Cantwell and Amanda Rysling who also work with Brian Dillon and Lyn Frazier). I think there are two reasons I conflate them: (1) the processing burden in phonological judgments and wug-tests is clearly of a different order than syntactic judgments, and likely easier to abstract from in modeling (2) where we do have two kinds of potential explanation (grammar and analogy), the models are often hard to distinguish, once grammars are made probabilistic, and analogical models are given richer representations (though see Daland et al. recently in Phonology and Albright and Hayes’ 2003).

2. Steve Abney

I’m taking well-formedness to be the same as grammaticality, and I’m assuming the interesting case is the gradient case. I’m assuming that degrees of grammaticality as assigned by the grammar are predictions of grammaticality/acceptability judgements made by a human. If one wants to factor one’s grammar into a “competence” portion and an “other factors” portion, that’s fine, but unless one is fully explicit about both pieces, we have no predictions to talk about.

10. Robert Daland

Hi all,
I’d like to weigh in again.

There have been several arguments in this thread which could be interpreted in the line of, “Don’t bother doing an experiment, because there are so many what-ifs, it’s not even clear what it would mean.” Certainly that has been one take-away that I got from Jeff, Ewan, and Thomas’ comments (though I don’t say that’s the one they *meant* for me to take away…).

Of course, there is a legitimate question of what we are modeling. For me at least, the ultimate goal is to provide a decent model of human behavior (that is to say, *observable* behaviors such as well-formedness judgments, and performance on speech segmentation and lexical decision and other tasks thought to tap linguistic knowledge). While human well-formedness judgments are decidedly more meta-linguistic than what I am really interested in, I *think* I am on safe ground in asserting that most of us involved in this discussion feel that well-formedness judgments are connected somehow to the grammar. Therefore, studying well-formedness judgments is useful to the extent that they shed light on the grammar. The crucial issue that Jeff raised, at least as I interpret is, is how well-formedness judgments are connected to the grammar.

I think it is uncontroversial that we cannot be worse off in trying to answer this question if we have more data about well-formedness judgments. Of course, theory and data are both important components of science, and conceptual unclarity is never resolved by more data alone. But a persistent theme in the discussion so far has been a lack of data that even bears on this question. Collecting more data is the obvious remedy. Right now, we do not even know the extent to which human judgments decline with increasing length. If we do not see such a decline, then perhaps we could say that Jeff has already won. So it seems to me that this data does have the potential to be informative, even if some outcomes are not 100% interpretable. Put another way, surely we will be in a better position to interpret that data once we know what the relevant data is.

I’d therefore like to offer a challenge. Let us suppose that Daland’s team will do *some* kind of experiment on length-related effects on well-formedness judgments over the winter quarter next year. What are the design properties of this experiment that will convince you the data is at least worthy of thought?

I will begin with a shoot-from-the-hip design. The experiment will consist of an equal number of words at each of 5 lengths; for concreteness let us say 30 monosyllables, 30 disyllables, 30 trisyllables, 30 4-syllable forms, and 30 5-syllable forms. The items will be systematically manipulated for well-formedness, with 1/3 containing no obviously ill-formed sub-part, 1/3 containing one serious phonotactic violation, and the remaining 1/3 containing two serious phonotactic violations. As in Daland et al. (2011) and Kawahara & Coetzee (2011), two different well-formedness tasks will be conducted — absolute rating task (hear a form, rate it on a Likert scale), relative rating task (hear two forms, pick which sounds better). Of course the experimental judgments will be assessed compared to state-of-the-art phonotactic models. Note that this design is quite similar to Daland et al. (2011) and Albright (2011?), except that it is specifically word length that is being manipulated rather than SSP or onset well-formedness.

1. Gaja Jarosz

Robert: I, for one, would be interested to see what you find in your proposed experiment. I suppose if you do find a decline in acceptability with length the question of whether wellformedness or something else is responsible will remain, but I think we will know more than before. For your experiment to usefully engage with Jeff’s critique, your stimuli would need to manipulate the properties that differentiate the wellformedness=probability hypothesis and the alternative(s). So I would be interested to hear exactly what the proposed alternatives are. Unpacking Jeff’s comment: “I suspect people are examining local parts of it and judging each of those as grammatical”, I wonder what exactly the right cases would be to test this claim. Presumably whatever violations the long forms have multiples of would have to be mild enough to still count as grammatical. So I wonder whether Jeff would say the crucial comparison is between a short perfectly well-formed string and a long perfectly well-formed string (for which probability=wellformedness would have to predict a difference but for which ‘local grammaticality’ may not) or whether it’s between a mildly offensive short string and an equally mildly offensive longer string. Should the longer string have more of those mildly offensive violations or the same amount to be the right comparison. Jeff?

2. Thomas Graf

I have to say I find this discussion a lot more confusing than I anticipated, so this comment will be rather rambly.

Some of the assumptions that are apparently common-place in phonology are pretty alien to me. In particular the conflation of well-formedness and acceptability strikes me just as misguided in phonology as it is in syntax, and it is for the reason I already presented above: irrespective of whether the two are cognitively identical, methodologically there are very good reasons to keep them separate. It doesn’t matter whether you treat them as ontologically distinct or the former as an abstraction of the latter (e.g. in the sense of Marr’s levels of description), the fact remains that acceptability obscures properties that emerge much more clearly with well-formedness.

Another difference is the focus on behavior. In mainstream generative syntax that’s mostly considered a naughty word, but there are a few generative syntacticians like Joan Bresnan who have undergone a similar shift towards modeling behavior in recent years. I’m fairly agnostic on this issue, once again for a formal reason: mathematically it is fairly easy to take a competence/well-formedness model and turn it into an acceptability/performance model, it’s just a matter of switching out the evaluation domain (e.g. {0,1} for well-formedness and [0,1] for acceptability). Everything else in the formalism stays the same. Empirically it’s of course anything but trivial to find the right acceptability model — there’s not even any solid evidence that probabilities offer the right granularity — but conceptually the problem is clear-cut.

Going in the other direction, however, is not easy. Turning an acceptability model into a well-formedness model is difficult because acceptability models rely on probabilities to do a lot of their work, so if you rip that out what you’re left with might be complete garbage. The problem would be less pronounced if there were a direct map from [0,1] back to {0,1} that replicated well-formedness judgments. Jeff’s argument, at least the way I understand it, is essentially that there is no direct map because probabilities decrease with length whereas well-formedness does not. That is the crucial point of the argument: well-formedness does not decrease with length or, more generally, the complexity of the structure because the whole point of well-formedness is to abstract away from these factors to bring out specific properties.

If you want to have a map, you have to construct it in such a way that it is aware of all the peculiarities of the well-formedness model, which of course presupposes a well-formedness model, at which point you could have just as well started with the well-formedness model and layered an acceptability model on top of it.

So to wrap up on a more productive note, here’s two criteria that I usually pay attention to when I have to decide whether a process should be assumed to be unbounded at the competence level:

1) Is there a cut-off point that can be defined in purely grammatical terms? For example, I do not believe that scrambling in German is unbounded because i) that would make it a much more complex process than anything else we know, and ii) sentences are instantaneously bad whenever two scrambled DPs have exactly the same feature specification with respect to person, case, gender and animacy. Note that point ii) is very natural from any grammar perspective whereas a processing account would need very specific assumptions (e.g. that feature-identical DPs are stored in the same “memory cell” and there’s only one cell for each feature configuration). That’s not to say that the processing account isn’t actually what’s really happening, but in the absence of a well-understood theory of memory it offers no real insight.

2) Does unboundedness explain properties that would not follow otherwise? This is the case with embedding constructions, where the fact that all levels of embedding obey the same constraints follows immediately if there is no upper bound on embeddings.

As you can see those are purely theoretical conditions that have little to do with actual acceptability. I am not sure that there is a good experimental criterion for identifying acceptability factors. For instance, all kinds of embedding are assumed to be unbounded, yet left and right embedding are easier than embeddings with crossing dependencies which in turn are easier than center embeddings. So there is no uniform relation between levels of embedding and decline in acceptability. This kind of indeterminacy is one of the reasons why there’s still no agreement whether certain island effects (in particular those involving weak islands) are due to the parser or the grammar.

11. Gaja Jarosz

Very interesting discussion! In general, I would be curious to try to unpack some properties / assumptions that seem to be conflated in much of the discussion above as it would be valuable to clarify exactly what the various objections are to. Much of the lack of clarity centers around what exactly we mean by probability = wellformedness (P=W).

1) Is P=W narrowly defined as the hypothesis that wellformedness can/should be defined in terms of a probability distribution over surface strings derived from phonotactic/surface considerations? Or does P=W encompass any model of grammaticality that relies on probability theory in principle?

I think Mark’s comments and Joe’s reply identify an important distinction to be made among possible probabilistic models: probability of surface strings could be determined not only by surface phonotactic considerations but also by competition in the UR->SR mapping. [ptak] (Polish for ‘bird’) might not be judged as very acceptable in English, but [pteto] *as a possible output for ‘potato’* might be rated quite highly.

So we could define grammaticality in terms of P(S | U)(U | C) as noted by Mark (I would want to be more flexible than Mark’s suggestion of taking the argmax, however, since I think there can be multiple grammatical outputs for a single UR, but this is a minor point). If we do so, and we can define a distribution P(C) and then sum over C and U, we get a marginal distribution over S. This is essentially what I pursued in my dissertation. So finally getting to the crux of my question: the issue is then we have a P(S), which will necessarily have the ‘probability gets smaller as words get longer’ property. However, even though this model is capable of assigning probability to surface strings, the probability distribution over surface strings is not itself the model of grammaticality in this hypothesis. Does the objection to P=W remain for this probabilistic model?

2) Does the objection to P=W crucially assume that P assigns non-zero probability to everything? This is a property of e.g. MaxEnt and Stochastic OT, but not a fundamental property of probability.

It seems to me that the problematic prediction that there are only finitely many well-formed strings follows from this assumption and the assumption that there is a strict cut-off between well-formed and ill-formed. Neither of these assumptions are obvious to me. To make this more concrete, what if all absolutely ungrammatical strings had probability zero (e.g. the cut-off for grammaticality is >0)? Then, there would be room for the set of (remotely) grammatical things to be infinite. I suspect the objection to P=W would remain in this context, but it would have to be characterized somewhat differently, I think. I’m curious what that would be.

1. Joe Pater Post author

A brief addition to this thread about probabilities over (UR, SR) pairs (probabilistic OT) vs. probabilities over SRs (Hayes-Wilson MaxEnt phonotactics): More broadly speaking, this can be seen as the question of whether we have conditional probability distributions (where in phonology we are conditioning on URs), or are trying to directly define a distribution over some space of representations (words, sentences). It does seem like the problem Jeff brings up shouldn’t apply when we are defining conditional probability distributions, which one is often doing in syntax as well as in phonology.

12. Brendan O'Connor

I have a more basic question — there seems to be a presumption that human judgments about grammaticality, well-formedness, or acceptability, should be related to language model probabilities. Why?

(I had my students in intro NLP class generate from 2gram and 3gram markov models and do grammaticality judgments on the outputs. I still need to check the statistics on those.)

1. Joe Pater Post author

Thanks for the question Brendan. For the Hayes and Wilson (2008) MaxEnt model of phonotactics, the main answer would be that it appears to be the best account of phonotactic knowledge that we have, in terms of predicting data from human judgments after it’s been trained on a lexicon (usually English). For a comparison with other models, see Daland et al. (2011: Phonology) (available here: https://sites.google.com/site/rdaland/). I don’t have an a priori view myself that a probabilistic model of phonotactic knowledge is necessary — one could, as some posters have suggested, have a non-probabilistic gradient model, or a categorical competence model that interacts with a gradient performance model to get observed gradience in judgments — but I don’t know an existing competitor that has been shown to outperform the HW learner (though do see Gorman’s 2013 UPenn diss: http://www.csee.ogi.edu/~gormanky/papers/gorman-dissertation.pdf). I think that much of the success of the HW model — and a good part of the reason that MaxEnt models in general have been attractive to phonologists — is that it defines probabilities over standard phonological representations, using constraints (=NLP features) that are also formulated like those that have been successfully previously applied in non-probabilistic phonological analysis.

2. Robert Daland

To follow up on what Joe said, Daland et al. (2011) review existing work and conclude that the best theory so far is that human well-formedness judgments are better predicted by log-probability than by any other measure that has been devised.

The experiment in Daland et al. (2011) was consistent with that hypothesis, as were the few other experiments they reviewed that could have tested it.

As discussed in Hayes & Wilson (2008), gradient judgments are a ubiquitous feature of phonotactic well-formedness tasks, even when all properties of the stimulus are controlled except for one particular phonotactic (e.g. onset well-formedness, cf. [bla] vs. [lba]).

Since that is what the data show, we need to have an explanation for it. Probabilistic models are one natural route for this, if a suitable *linking theory* can be found which relates the model probability to the human judgment.

13. Robert Daland

Hi all,
In following up on this thread, I ran across an oldie but a goodie here (Frisch, Large, & Pisoni, 2000 — http://cas.usf.edu/~frisch/Frisch_Large_Pisoni_00.pdf). This is one of the most careful papers on the topic that I’ve seen. Among the numerous predictors they considered, the log probability of a word (according to the language model they used, essentially the syllabic parser of Coleman & Pierrehumbert, 1997) was generally the best predictor of acceptability ratings.

They investigated low- versus high- probability CVCV(CV(CV))C sequences (2-4 syllables, singleton onsets, word coda). The acceptability ratings declined with word length for both low- and high- probability stimuli. Note that low-probability stimuli here were defined as phonotactically licit but highly unexpected, e.g. ZUyaythes. The decline was approximately exponential (that is, linear in log-probabilities).

This data is consistent with the claim that Daland made in response to Jeff Heinz’s question, that acceptability ratings are best modeled by the log probabilities of a language model. (Note however that Daland’s GLOW paper and the corresponding Phonology paper are primarily concerned with making sure that the language model is well-defined, and do not make any strong claims about acceptability judgments.) Of course, as numerous commentators have pointed out here, it is also consistent with some kind of “competence/performance” distinction and attributing the *observable* decline in acceptability to some kind of *unobservable* performance module (though I am unclear as to what that is really supposed to mean).

In a new (but perhaps unsurprising?) wrinkle, Pierrehumbert and her student Jeremy Needle have shown that nonwords get a significant acceptability boost if they contain recognizable sub-parts (or more specifically, they showed this for affixes). They claim that the single most likely parse was the best predictor for the acceptability rating, although their work is not publicly available yet.

14. Jeff Heinz

I have been away for awhile, and was delighted to find all these
comments here. So, yes, I agree with Steve ( “*Obviously*,
well-formedness models and language models are not the same thing.”),
Thomas (“there is no direct map because probabilities decrease with
length whereas well-formedness does not.”), and Mark (“I’d be
surprised if there were a simple probability threshold that
characterises grammaticality.”) about these things. Gaja’s question
about developing a more sophisticated way to relate probability to
well-formedness is exactly the point. Clark and Lappin have more
recent work on this in syntax than what I mentioned earlier like a
2015 ACL paper (see here)

Gaja Jarosz wrote:

“… So I wonder whether Jeff would say the crucial comparison is
between a short perfectly well-formed string and a long perfectly
well-formed string (for which probability=wellformedness would have to
predict a difference but for which ‘local grammaticality’ may not) or
whether it’s between a mildly offensive short string and an equally
mildly offensive longer string… ”

Yes, that is absolutely correct. Robert’s proposed experiments
containt such stimuli, and if I have any recommendations it would be
just to focus on these cases. I see each of these comparisons as
individual experiments, both of which test the predictions of the
hypothesis.

Robert Daland wrote:

“[Frish et al. 2000’s] data is consistent with the claim that Daland
made in response to Jeff Heinz’s question, that acceptability ratings
are best modeled by the log probabilities of a language model… Of
course, as numerous commentators have pointed out here, it is also
consistent with some kind of “competence/performance” distinction and
attributing the *observable* decline in acceptability to some kind of
*unobservable* performance module (though I am unclear as to what that
is really supposed to mean).”

Three comments. First, on the face of it the result is interesting because in syntax it
is much clear that acceptability does not decline with sentence
length.

Second, I wish to point out that the instructions to speakers are not
clear enough to my liking for speakers to be rating
well-formedness. In the Frisch et al. 2000 study the Likert scale used
conflates likelihood and well-formedness: “Participants were
instructed to rate the nonwords for their wordlikeness. The
instructions included descriptors for a 7-point scale to be used for
rating the nonwords. A rating of 1 was described as “Low–Impossible—
this word could never be a word of English,” rating of 4 as
“Medium–Neutral—this word is equally good and bad as a word of
English,” and 7 as “High–Possible—this word could easily be a word
of English.” The other numbers were to be used for nonwords between
these categories. Ratings of 2 or 3 represented “unlikely” and 5 or 6
were “likely.” We instructed participants to work as quickly as
possible without sacrificing accuracy.”

So points 1,4 and 7 could be interpreted as well-formedness but the
others are clearly about likelihood. Even 1 and 7 are about “could it
be a word of English?” Could the English nonce word [mer.si] be a word
of English? No, it’s French! What is a naive subject to do?

I go back to my earlier example:
[ka.pa.la.tSa.ko.la.pi.ni.pi.sau.ki.ma.la.ga.la.nu]. Sure, it’s
unlikely to ever be a word of English (simply because long
words are rare), so in this task I’d rate it pretty low. In fact I’d
rate the likelihood of some very long English sentences low
too. But w.r.t. well-formedness, I actually think it is quite alright. It
even rolls off the tongue pretty easily, much like the list
“Appalachacola, Winnipausekee, Kalamazoo”.

Third, it should be no surprise that the performance module is
unobservable (at least directly). So is the grammar and more generally
all mental phenomenon. All the evidence for them is indirect
evidence. As Thomas points out, the competenece/performance
distinction is a very useful abstraction. It factors the problem into
simpler pieces. It can be much simpler not to treat length as a factor
in well-formedness (this
paper
helps explain why: “This simplest and clearest finite model
often consists of an infinite model with a finiteness condition added
to it.”). IF your description of the above results are representative
of the facts, then the performance model is probably exactly what you
say it is: a log-linear decline in acceptability w.r.t. length. It
should then be possible able to factor that out to divine
well-formedness from acceptability (though this will still be an
interesting non-trivial problem), which would be an answer to Gaja’s
question.