On the Synthetic Learner blog, Emmanuel Dupoux recently posted some comments on a paper co-authored by Gaja Jarosz and Shira Calamaro that recently appeared in Cognitive Science. Gaja has also written a reply. While you are there, take a peek around the blog, and the Bootphon website: Dupoux has a big and very interesting project on unsupervised learning of words and phonological categories from the speech stream.
There are some old arguments against probabilistic models as models of language, but these do not seem to have much force anymore, especially because we now have models that can compute probabilities over the same representations that we use in generative linguistics (Andries Coetzee and I have an overview of probabilistic models of phonology in our Handbook chapter, Mark Johnson has a nice explanation of the development of MaxEnt models and how they differ from PCFG’s as well as other useful material on probabilistic models as models of language learning, Steve Abney has a provocative and useful piece about how the goals of statistical computational linguistics can be seen as the goals of generative linguistics; see more broadly the recent debate between Chomsky and Peter Norvig on probabilistic approaches to AI; see also the Probabilistic Linguistics book and Charles Yang’s review).
That’s not to say that there can’t be issues in formalizing probabilistic models of language. In a paper to appear in Phonology (available here) Robert Daland discusses issues that can arise in defining a probability distribution over the infinite set of possible words, in particular with Hayes and Wilson’s (2008) MaxEnt phonotactic grammar model. In the general case, for this to succeed, the probability of strings of increasing length must decrease sharply enough such that the sum of their probabilities never exceeds 1, and simply continues to approach it. Daland defines the conditions under which this will obtain in the Hayes and Wilson model in terms of the requirements on the weight of a *Struc constraint that assigns a penalty that increases as string length increases.
In the question period after Robert’s presentation of this material at the GLOW computational phonology workshop in Paris in April, Jeff Heinz raised an objection against the general notion of formalizing well-formedness in terms of probabilities, and he repeated this argument at the Manchester fringe workshop last week. Here’s my reconstruction of it (hopefully Jeff will correct me if I get it wrong – I also don’t have the references to the earlier work that made this argument). Take a (relatively) ill-formed short string. Give it some probability. Now take a (relatively) well-formed string. Give it some probability. Now concatenate the well-formed string enough times until the whole thing has probability lower than the ill-formed string, which it eventually will.
This is meant to be a paradox for the view that we can formalize well-formedness in terms of probabilities: the long well-formed string has probability lower than the short ill-formed string. It’s not clear to me, however, that there is a problem (and it wasn’t clear to Robert Daland either – the question period discussion lasted well into lunch, with Ewan Dunbar taking up Jeff’s position at our end of the table). Notice that Jeff’s argument is making an empirical claim that the concatenation of the well-formed strings does not result in a well-formedness decrease. When I talked to him last week, he claimed that this is clearer in syntax than phonology. Robert’s position (which I agree with) is that it likely does – though from his review of the literature on phonotactic well-formedness judgments we don’t seem to have empirical data on this point.
Robert asked us to work with him in designing the experiment, and at the time I wasn’t sure that this was the best use of our lunch time, but I think he has a point. If this is in fact an empirical issue, and we can agree beforehand on how to test it, then this would save a lot of time compared with the usual process of the advocates of one position designing an experiment, which even if it turns out the way they hope, can then be criticized by the advocates of the other position as not having operationalized their claim properly, and so on…
It’s also of course possible that this is not an empirical issue: that there is a concept of perfect well-formedness that probabilistic models cannot capture. This reminds me of a comment on a talk I got once from a prominent syntactician when I discussed probabilistic models that can give probability vanishingly close to zero to ill-formed structures: “but there are sentences that I judge as completely out for English – they should have probability zero”. My response was to simply repeat the phrase vanishingly close to zero, and check to make sure he knew what I meant.
I’ve recently had some useful discussion with people about the nature of representations in OT, and how they did or did not (or should or should not) change from a theory with inviolable constraints (= principles and parameters theory). I’d like to summarize my thoughts, and would very much welcome further discussion.
In our discussion following Tobias Scheer’s mfm fringe presentation, I brought up the point that when one switches to violable constraints, it’s not obvious that the representations should stay the same. Tobias asked for an example, and I didn’t have a good one right away, but then realized that a particularly clear and worked out one is in the discussion of extrametricality vs. nonfinality in Prince and Smolensky 1993/2004. Gaja Jarosz also reminded me of underspecification: since markedness is expressed in output constraints in OT, it’s not obvious that one also requires a theory of input underspecification for that purpose.
I think my own experience of working on my *NC project in the mid nineties illustrates some more aspects of what happened to representations as OT was being extended to segmental phonology, and brings up some further issues. When I started that project, I was looking for an explanation for the facts in terms of feature markedness and positional licensing. I wanted to get the directionality of postnasal voicing, which unlike most other local assimilation processes is L-to-R, instead R-to-L. I also wanted to get conspiracies amongst processes that resolve nasal-voiceless obstruent clusters. I tried hard and failed, and eventually “gave up” and used the formally stipulative but phonetically grounded *NC constraint. I later realized that this failure was part of a more general issue: it doesn’t seem to be the case that positional markedness can always be derived from the combination of general context-free markedness constraints and general positional licensing constraints. The best I could do in terms of those assumptions was to say that the coda nasal [voice] wanted to be licensed in onset position and hence spread, but that didn’t explain why it was just nasals that did this, nor did it deal sufficiently well with directionality. I have some more discussion of the general problems with deriving positional markedness from prosodic licensing, and further references, in a discussion of local conjunction on p. 10 of this paper (also in my review of the Harmonic Mind).
Taking the approach to segmental phonology in the *NC proposal, we can ask what that commits us to in terms of a theory of representations. It looks to me like what we would need is a feature set that is sufficiently expressive to formalize our constraints, but that’s it. Phonetic grounding is expressed as restrictions on the universal set of constraints (or as restrictions on possible rankings, in work like Steriade’s on perceptual grounding). And this set of features could be universal, with no language-specific choices (thanks to Pavel Iosad for a question on this) – contrast and its absence can be captured by ranking alone. Furthermore, I don’t think there is any sense in which this theory has the concept of a natural class (thanks to Kristine Yu for a question on that). So this perhaps at least partially explains why the nature of segmental representations has not been a big topic in at least some variants of OT.
Now I should say that I can see lots of reasons why you would want to say that features do differ from language to language, and why the particular feature set you choose could have consequences for predictions about learning and generalization. But in terms of the particular theory I’ve just described, I don’t see any arguments for language-specific differences in feature specification. I should probably also say that I see reasons why one might not want to model the role of phonetics in phonology as just stipulations about the constraint set with some armchair phonetic justification – there are clearly plenty of alternatives. But the proposal is a natural extension to general approach to encoding of substance in OT as stipulations about constraints: e.g. that there is Onset and NoCoda, but not NoOnset and Coda.
Finally, let me emphasize that what I’ve said about a lack of interest in the nature of segmental representations based on my *NC work should not be taken as representative of a lack of interest in segmental features in OT as a whole (or even in my mind!). For example, there are extremely interesting questions about the nature of the representations needed for `spreading’: Bakovic (2000: diss.) takes the position that there is no spreading, while others have defended autosegmental representations or adopted gestural representations, and yet others (e.g. Cole and Kisseberth) have proposed domain-based representations.