# What’s Harmony?

From an e-mail from Paul Smolensky, March 28, 2015. Even though he wasn’t doing phonology in the mid-1980’s when he coined the term “Harmony Theory”, Paul had apparently taken a course on phonology with Jorge Hankamer and found vowel harmony fascinating.

“Harmony” in “Harmony Theory” arises from the fact that the Harmony function is a measure of *compatibility*; the particular word was inspired by vowel harmony, and by the letter ‘H’ which is used in physics for the Hamiltonian or energy function, which plays in statistical mechanics the same mathematical role that the Harmony function plays in Harmony theory: i.e., the function F such that prob(x) = k*exp(F(x)/T).
(Although I took the liberty of changing the sign of the function; in physics, it’s p(x) = k*exp(–H(x)/T), in Harmony Theory, it’s p(x) = k*exp(H(x)/T). That’s because it drove me crazy working in statistical mechanics that that minus sign kept coming and going and coming and going from equation to equation, leading to countless errors; I just dispensed with it at the outset and cut all that nonsense off at the pass.)
From an e-mail from Mark Johnson Jan. 16th, 2016:
I always thought the reason why the physicists had a minus sign in the exponential was that otherwise temperatures would have to be negative.  But I guess you can push the negation into the Hamiltonian, which is perhaps what Paul did.
From an e-mail from Paul Smolensky, Feb. 10th, 2016:
Yes, that’s just what I did. Instead of minimizing badness I switched to maximizing goodness. I’m just that kind of guy.
From an e-mail from Mark Johnson, Feb. 10th, 2016:

Probabilities are never greater than one, so log probabilities are always less than or equal to zero.  So a negative log likelihood is always a positive quantity, and smaller negative log likelihood values are associated with more likely outcomes.  So one way to understand the minus sign in the Gibbs-Boltzmann distribution is that it makes H(x) correspond to a negative log likelihood.

But I think one can give a more detailed explanation.

In a Gibbs-Boltzmann distribution p(x) = k*exp(–H(x)/T), H(x) is the energy of a configuration x.

Because energies H(x) are non-negative (which follows from the definition of energy?), and given a couple of other assumptions (e.g., that there are an infinite number of configurations and energies are unbounded — maybe other assumptions will do?), it follows that probability must decrease with energy, otherwise the inverse partition function k would not exist (i.e., the probability distribution p(x) would not sum to 1).

So if the minus sign were not there, the temperature T (which relates energy and probability) would need to be negative.  There’s no mathematical reason why we couldn’t allow negative temperatures, but the minus sign makes the factor T in the formula correspond much closer with our conventional understanding of temperature.

In fact, I think it is amazing that the constant T in the Gibbs-Boltzmann formula denotes exactly the pre-statistical mechanics concept of temperature (well, absolute temperature in Kelvin).  In many other domains there’s a complex relationship between a physical quantity and our perception of it; what is the chance of a simple linear relationship like this for temperature?

But perhaps it’s not a huge coincidence.  Often our perceptual quantities are logarithmically related to physical quantities, so perhaps its no accident that T is inside the exp() rather than outside (where it would show up as an “exponential temperature” term).  And the concept of temperature we had before Gibbs and Boltzmann wasn’t just a naive perception of warmth; there had been several centuries of careful empirical work on properties of gases, heat engines, etc., which presumably lead scientists to the right notion of temperature well before the Gibbs-Boltzmann relationship was discovered.

From an e-mail from Paul Smolensky March 27, 2016:
Here are some quick thoughts.
0. Energy E in physics is positive. That’s what forces the minus sign in p(x) \propto exp(—E(x)/T), as Mark observes.

Assuming x ranges over an infinite state space, the probability distribution can only be normalized to sum to one if the exponent approaches zero as x -> infinity, and if E(x) > 0 and T > 0, this can only happen if E(x) -> infinity as x -> infinity and we have the minus sign in the exponent.

1. Why is physical E > 0?

2. Perhaps the most fundamental property of E is that it is conserved: E(x(t)) = constant, as the state of an isolated physical system x(t) evolves in time t. From that point of view there’s no reason that E > 0; any constant value would do.

3. For a mechanical system, E = K + V, the sum of the kinetic energy K derived from the motion of the massive bodies in the system and the potential energy V. Given Newton’s second law, F = ma = m dv/dt, E is conserved when F = — grad V and K = mv^2/2

then dE/dt = d(mv(t)^2/2)/dt + dV(x(t))/dt = mv dv/dt + dx/dt . grad V = v(ma) + v(—F) = 0; that’s where the — sign in —grad V comes from.

Everything in the equation E = K + V could be inverted, multiplied by —1, without change in the conservation law. But the commonsense meaning of “energy” is something that should increase with v, hence K = mv^2/2 rather than —mv^2/2.

4. Although K = mv^2/2 > 0, V is often negative.

E.g., for the gravitational force centered at x = 0, F(x) = —GmM x/|x|^3  = —grad V if V(x) = —GmM/|x| < 0
(any constant c can be added to this definition of V without consequence; but even so, for sufficiently small x, V(x) < 0)
Qualitatively: gravitational force is attractive, directed to the origin in this case, and this force is —grad V, so grad V must point away from the origin, so V must increase as x increases, i.e., must decrease as x decreases. V must fall as 1/|x| in order for F to fall as 1/|x|^2 so the decrease in V as x —> 0 must take V to minus infinity.

5. In the cognitive context, it’s not clear there’s anything corresponding to the kinetic energy of massive bodies. So it’s not clear there’s anything to fix a part of E to be positive; flipping E by multiplication by —1 doesn’t seem to violate any intuitions. Then, assuming we keep T > 0, we can (must) drop the — in p(x) \propto exp(—E(x)/T) = exp(H(x)/T) where we define Harmony as H = —E. Now the probability of x increases with H(x); lower H is avoided, hence higher H is “better”, hence the commonsense meaning of “Harmony” has the right polarity.

E-mail from Mark Johnson March 27, 2016

Very nice!  I was thinking about kinetic energy, but yes, potential energy (such as gravitational energy) is typically conceived as negative (I remember my high school physics class, where we thought of gravitational fields as “wells”).  I never thought about how this is forced once kinetic energy is positive.

Continuing in this vein, there are a couple of other obvious questions once one thinks about the relationship between Harmony theory and exponential models in physics.

For example, does the temperature T have any cognitive interpretation?  That is, is there some macroscopic property of a cognitive system that T represents?

More generally, in statistical mechanics the number (or more precisely, the density) of possible states or configurations varies as a function of their energy, and there are so many more higher energy states than lower energy ones that the typical or expected value of a physical quantity like pressure is not that of the more probable low energy states, but instead determined by the more numerous, less probable higher energy states.

I’d be extremely interested to hear if Paul knows of any cases where this or something like it occurs in cognitive science.  I’ve been looking for convincing cases ever since I got interested in Bayesian learning!  The one case I know of has to do with “sparse Dirichlet priors”, and it’s not exactly overwhelming.

E-mail from Paul Smolensky, March 27, 2016

The absolute magnitude of T has no significance unless the absolute magnitude of H does, which I doubt. So I’d take Mark’s question about T to come down to something like: what’s the cognitive significance of T —> 0 or T —> infinity or T ~ O(1)?

And I’d look for answers in terms of the cognitive role of different types of inference. T —> 0 gives maximum-likelihood inference; T —> infinity gives uniform sampling; T ~ O(1) gives sampling from the distribution exp(H(x)).  Mark, you’re in a better position to interpret the cognitive significance of such inference patterns.

As for the question of density of states of different Harmony/energy, the (log) density of states is essentially the entropy, so any cognitive significance entropy may have — e.g., entropy reduction as predictor of incremental sentence processing difficulty à la Hale — qualifies as cognitive relevance of density of states. As for the average value of a quantity reflecting less-probable-but-more-numerous states more than more-probable states, I’m not sure what the cognitive significance of average values is in general.

# Wellformedness = probability?

There are some old arguments against probabilistic models as models of language, but these do not seem to have much force anymore, especially because we now have models that can compute probabilities over the same representations that we use in generative linguistics (Andries Coetzee and I have an overview of probabilistic models of phonology in our Handbook chapter, Mark Johnson has a nice explanation of the development of MaxEnt models and how they differ from PCFG’s as well as other useful material on probabilistic models as models of language learning, Steve Abney has a provocative and useful piece about how the goals of statistical computational linguistics can be seen as the goals of generative linguistics; see more broadly the recent debate between Chomsky and Peter Norvig on probabilistic approaches to AI; see also the Probabilistic Linguistics book and Charles Yang’s review).

That’s not to say that there can’t be issues in formalizing probabilistic models of language. In a paper to appear in Phonology (available here) Robert Daland discusses issues that can arise in defining a probability distribution over the infinite set of possible words, in particular with Hayes and Wilson’s (2008) MaxEnt phonotactic grammar model. In the general case, for this to succeed, the probability of strings of increasing length must decrease sharply enough such that the sum of their probabilities never exceeds 1, and simply continues to approach it. Daland defines the conditions under which this will obtain in the Hayes and Wilson model in terms of the requirements on the weight of a *Struc constraint that assigns a penalty that increases as string length increases.

In the question period after Robert’s presentation of this material at the GLOW computational phonology workshop in Paris in April, Jeff Heinz raised an objection against the general notion of formalizing well-formedness in terms of probabilities, and he repeated this argument at the Manchester fringe workshop last week. Here’s my reconstruction of it (hopefully Jeff will correct me if I get it wrong – I also don’t have the references to the earlier work that made this argument). Take a (relatively) ill-formed short string. Give it some probability. Now take a (relatively) well-formed string. Give it some probability. Now concatenate the well-formed string enough times until the whole thing has probability lower than the ill-formed string, which it eventually will.

This is meant to be a paradox for the view that we can formalize well-formedness in terms of probabilities: the long well-formed string has probability lower than the short ill-formed string. It’s not clear to me, however, that there is a problem (and it wasn’t clear to Robert Daland either – the question period discussion lasted well into lunch, with Ewan Dunbar taking up Jeff’s position at our end of the table). Notice that Jeff’s argument is making an empirical claim that the concatenation of the well-formed strings does not result in a well-formedness decrease. When I talked to him last week, he claimed that this is clearer in syntax than phonology. Robert’s position (which I agree with) is that it likely does – though from his review of the literature on phonotactic well-formedness judgments we don’t seem to have empirical data on this point.

Robert asked us to work with him in designing the experiment, and at the time I wasn’t sure that this was the best use of our lunch time, but I think he has a point. If this is in fact an empirical issue, and we can agree beforehand on how to test it, then this would save a lot of time compared with the usual process of the advocates of one position designing an experiment, which even if it turns out the way they hope, can then be criticized by the advocates of the other position as not having operationalized their claim properly, and so on…

It’s also of course possible that this is not an empirical issue: that there is a concept of perfect well-formedness that probabilistic models cannot capture. This reminds me of a comment on a talk I got once from a prominent syntactician when I discussed probabilistic models that can give probability vanishingly close to zero to ill-formed structures: “but there are sentences that I judge as completely out for English – they should have probability zero”. My response was to simply repeat the phrase vanishingly close to zero, and check to make sure he knew what I meant.

# Representations in OT

I’ve recently had some useful discussion with people about the nature of representations in OT, and how they did or did not (or should or should not) change from a theory with inviolable constraints (= principles and parameters theory). I’d like to summarize my thoughts, and would very much welcome further discussion.

In our discussion following Tobias Scheer’s mfm fringe presentation, I brought up the point that when one switches to violable constraints, it’s not obvious that the representations should stay the same. Tobias asked for an example, and I didn’t have a good one right away, but then realized that a particularly clear and worked out one is in the discussion of extrametricality vs. nonfinality in Prince and Smolensky 1993/2004. Gaja Jarosz also reminded me of underspecification: since markedness is expressed in output constraints in OT, it’s not obvious that one also requires a theory of input underspecification for that purpose.

I think my own experience of working on my *NC project in the mid nineties illustrates some more aspects of what happened to representations as OT was being extended to segmental phonology, and brings up some further issues. When I started that project, I was looking for an explanation for the facts in terms of feature markedness and positional licensing. I wanted to get the directionality of postnasal voicing, which unlike most other local assimilation processes is L-to-R, instead R-to-L. I also wanted to get conspiracies amongst processes that resolve nasal-voiceless obstruent clusters. I tried hard and failed, and eventually “gave up” and used the formally stipulative but phonetically grounded *NC constraint. I later realized that this failure was part of a more general issue: it doesn’t seem to be the case that positional markedness can always be derived from the combination of general context-free markedness constraints and general positional licensing constraints. The best I could do in terms of those assumptions was to say that the coda nasal [voice] wanted to be licensed in onset position and hence spread, but that didn’t explain why it was just nasals that did this, nor did it deal sufficiently well with directionality. I have some more discussion of the general problems with deriving positional markedness from prosodic licensing, and further references, in a discussion of local conjunction on p. 10 of this paper (also in my review of the Harmonic Mind).

Taking the approach to segmental phonology in the *NC proposal, we can ask what that commits us to in terms of a theory of representations. It looks to me like what we would need is a feature set that is sufficiently expressive to formalize our constraints, but that’s it. Phonetic grounding is expressed as restrictions on the universal set of constraints (or as restrictions on possible rankings, in work like Steriade’s on perceptual grounding). And this set of features could be universal, with no language-specific choices (thanks to Pavel Iosad for a question on this) – contrast and its absence can be captured by ranking alone. Furthermore, I don’t think there is any sense in which this theory has the concept of a natural class (thanks to Kristine Yu for a question on that). So this perhaps at least partially explains why the nature of segmental representations has not been a big topic in at least some variants of OT.

Now I should say that I can see lots of reasons why you would want to say that features do differ from language to language, and why the particular feature set you choose could have consequences for predictions about learning and generalization. But in terms of the particular theory I’ve just described, I don’t see any arguments for language-specific differences in feature specification. I should probably also say that I see reasons why one might not want to model the role of phonetics in phonology as just stipulations about the constraint set with some armchair phonetic justification – there are clearly plenty of alternatives. But the proposal is a natural extension to general approach to encoding of substance in OT as stipulations about constraints: e.g. that there is Onset and NoCoda, but not NoOnset and Coda.

Finally, let me emphasize that what I’ve said about a lack of interest in the nature of segmental representations based on my *NC work should not be taken as representative of a lack of interest in segmental features in OT as a whole (or even in my mind!). For example, there are extremely interesting questions about the nature of the representations needed for `spreading’: Bakovic (2000: diss.) takes the position that there is no spreading, while others have defended autosegmental representations or adopted gestural representations, and yet others (e.g. Cole and Kisseberth) have proposed domain-based representations.

# Data in generative phonology

I’d like to raise as a discussion topic the question of what the data are that we are trying to explain in generative phonology. In my view, the lack of clarity about this issue is a bigger foundational issue in our field than the lack of clarity about the goals we are pursuing, one of the discussion points I raised in my mfm fringe workshop presentation. It’s also a foundational issue not only for what I called Classical Universal Phonology in that presentation, but for just about any approach to phonology one can imagine. I should be clear that I don’t think that there needs to be a uniform set of data or goals. Rather, I think we’d be making quicker progress towards our broader shared goals of understanding the formal structure of phonologies, and explaining learning and typology, if we made our commitments in these respects more explicit in our work.

To get the discussion going, let me repeat the worry I expressed in the mfm fringe discussion, and mention some other data-related points that came up. When Marc van Oostendorp pressed me on my assertion that data issues were foundational issues, I brought up the lack of a definition of productivity as an example. It’s unfortunately too common that when an analysis or theory fails to capture some data pattern, the claim is made that the pattern is unproductive (e.g. that there are exceptions, that there are no alternations or that they are limited in some way, etc.), without applying the same scrutiny and criteria to the data that the theory is capturing. Probably even more common is that exceptions or variation are abstracted from, again without any clear criteria on when that can be done. My own belief is that productivity is gradient (see Hayes’ textbook ch. 9), and that we need theories that capture that gradience. But whether we are working with theories that are categorical or gradient in this respect, we need to define productivity if we are going to use it as a criterion for what data we need to explain.

In his question period at the fringe, Michael Becker pressed his interlocutors to provide evidence that the generalizations they saw in existing alternations were in fact encoded as generalizations in speakers’ minds. Becker’s approach, like that of a lot of other current work, is to test productivity experimentally. I’m on board with that program, but I’m also on board with good old analysis of corpus data (where ‘corpus’ includes the data from grammars and dictionaries that phonologists typically study), and I’m starting to get worried about what to do when the two sets of data point in very different directions. For example, the ‘stress heavy if penult’ part of the Latin stress rule is a nearly exception-free pattern in unsuffixed nouns in English. But as Claire Moore-Cantwell (p.c.) reports, it seems that it’s not particularly productive in nonce word productions/judgments. Claire has some good ideas about how to relate the corpus data to the judgments via learning, but it’s clear that the grammar is going to look very different from those posited for English from Chomsky and Halle (1968) onwards.

Wendell Kimper mentioned in his talk the issue that the set of attested human languages appears to be a small sample from the space of possible human languages. There are various kinds of statistical measures and data controls that we can use to determine how robust the typological generalizations are that we observe. But Kimper also reports that vowel harmony looked at that way may provide relatively little information, since many of the patterns of each type of harmony come from the same language families. My gut feeling, like Wendell’s I think, is that in those circumstances we should still keep going with the usual practice of just making an attested/unattested cut, and hoping that we are modeling signal rather than noise. But it is a worry, and probably one of the reasons that it’s good that we’re not putting all of our eggs in the typology modeling basket. A possible strategy is to focus on typological claims with a relatively large scope, for example, the size of stress windows, or the absence of sour grapes-style harmony (and the presence of spreading up to a blocker).

# Discussion: ‘Whither OT’ handout

I’d very much welcome comments or questions on my handout from the Manchester fringe session, held May 27th, 2015  [also: prepared introduction not on handout]. I’m not planning on turning it into a paper, though further discussion might change my mind! I also hope that this page can serve as a place for discussion of other issues raised at the workshop – links to handouts and other relevant materials would be very much appreciated.

Violable constraints in Classical Universal Phonology and beyond

Abstract: It appears to me that ‘Classical Universal Phonology’ (CUP) as a whole, rather than OT in particular, is receiving less attention from phonologists in the 21st century. I highlight some of the contributions of violable constraints to CUP, and provide an overview of some developments in generative phonology outside of CUP per se, again emphasizing the role of violable constraints, especially in formalizing grammatical learning. I conclude with some speculations about why CUP seems to be less popular these days, and about what we might expect for the future.

# Implicit and explicit learning

In this post, I want to explain why I’ve recently become interested in distinctions between implicit and explicit learning (also procedural and declarative memory – see below on the connection), and provide a quick overview of the literature I’ve been able to assimilate, as well as mention some things that seem particularly interesting to investigate with respect to phonological learning. The literature is vast, and there is thus far very little that has been done in this area in phonology (with the notable exception of the English past tense), so there’s lots of room for reading, thinking, talking and research, and I’d really welcome others’ thoughts!

The distinction came up for me in joint research with Elliott Moreton and Katya Pertsova [1]. We (EM and JP) developed a MaxEnt model of phonotactic learning, which we found out was virtually identical to a model of visual category learning that had been developed, and then abandoned, in the late 80s. It was abandoned because it made the wrong predictions about the relative difficulty of types of visual category. To our initial surprise, when EM and KP tested the learning of the phonotactic analogues of those category types, the predictions of our model were supported. After some more thought and reading of the psychological literature, the result seemed less surprising, but no less interesting. The classic visual category learning experiments are prototypical exercises in explicit learning: you are told you are to learn a rule separating two types of object, the relevant features are extremely salient and verbalizable (e.g. color, shape, size), in training you are asked to classify objects and are given feedback, and you are sometimes even asked for your current hypothesis about the rule. When the methodology is changed so that learning becomes even somewhat less explicit (even by omitting the instruction to seek rules [2]) the order of difficulty changes in the direction of the predictions of our model. Phonotactic learning is typically (and naturally) more implicit: the relevant phonological features are difficult, if not impossible to verbalize, and training proceeds by providing only positive examples of one category (i.e. words that are “in” the language). Our conclusion / hypothesis is that implicit (typical phonotactic) learning is well characterized by “cue-based” models like ours, while explicit (classic visual category) learning is not – “rule-based” models do better in that domain. See our paper on what we mean by “cue-based” and “rule-based”.

Implicit / explicit learning distinctions have a long and controversial history in psychology, and also connect to a very heated debate across psychology and linguistics. The recent debate in category learning in cognitive psychology [3], like the older psycho-linguistic debate [4], revolves around the question of whether there are two separate learning systems. Two systems models in both domains make links to a distinction between declarative and procedural memory ([5], [6]). As far as I can tell, no connections have yet been made between these two two systems literatures (I haven’t even found any cross-references yet). Future attempts to make these connections will need to confront the fact “rule-based learning” is implicit in the psycho-linguistic literature, but explicit in category learning. The two systems view in language makes a distinction between knowledge of words (explicit, declarative) and knowledge of rules for how words combine (implicit, procedural) – a quite famous distinction, thanks to Pinker. The term “rule” in category learning is by definition a generalization that one can make explicitly, and is thus linked with declarative memory. So those are the differences. But there are some ways in which the meaning of rule does overlap across the domains – rules are supposed to express relatively simple generalizations (e.g. broad in application in linguistics, with sometimes a complexity metric applied to choose amongst rules, simple in terms of featural makeup in visual category learning).

In thinking and talking about the potential extensions of two systems ideas to phonology, I prefer the terms “explicit” vs. “implicit” over “declarative” vs. “procedural” as well as “words” vs. “rules”. On the first alternative, there is also no obvious way in which knowledge of phonotactics is procedural. On the second, the relatively implicit learning that we have been studying, the learning of phonotactics, is the learning of generalizations over the phonological shape of words – “rules” about “words”. Some evidence that there is a common cognitive underpinning between syntactic rules and phonotactic rules comes from the fact that both can elicit a Late Positive Component / P600 in ERP studies (see [7] for some recent work and references).

A clear direction for research is to try to get evidence of which memory systems (assuming that there are in fact multiple systems) are recruited in different types of phonological learning. A pioneering study in this respect [8] finds evidence of declarative memory being linked to what they call analogy (in fact an opaque interaction), and procedural memory linked to simple concatenation of an affix. There are a number of issues with this study, and lots of directions for further work. One of its further results that I find intriguing is that no correlation was found between working memory and in success in phonological learning. There seems to be a quite solid association between working memory and success in the classic visual category learning paradigm mentioned above [9]. Does working memory capacity correlate with success in phonotactic learning? (Probably not.) With any other aspect of phonological learning? (Probably: see the literature on individual differences and the phonological loop, which is a type of working memory.) More broadly, are there robust individual differences in phonotactic learning and other kinds of phonological learning? One reason to be optimistic that artificial language learning might lead to insight into individual differences is that it seems to have been used in as one of the measures in all of the language learning aptitude tests that are predictive of success in natural language learning [10]. Individual differences in implicit learning tend not to be very robust (see [11] for a review), so if we find a measure that has good test-retest reliability, this could be a contribution of broad interest (has anyone done test-retest with Saffran-style TP learning?) There are interesting potential connections between implicit learning in language and music that seem worth exploring [12]. And dreaming really big, we might imagine doing very large scale studies of individual differences over the web, and collecting genetic samples. Finally, getting back to phonological theory (and to earth), if it is the case that different aspects of phonology are learned using different cognitive sub-systems (see [13] for some recent ERP evidence), this should have deep consequences for how we model knowledge of phonology and its learning.

[1] Moreton, Elliott, Joe Pater and Katya Pertsova. 2014. Phonological concept learning. Ms, University of North Carolina and University of Massachusetts Amherst. Resubmitted to Cognitive Science August 2014, revised version of 2013 paper, comments still very much welcome.

[2]  Kurtz, K. J., K. R. Levering, R. D. Stanton, J. Romero and S. N. Morris (2013). Human learning of elemental category structures: revising the classic result of Shepard, Hovland, and Jenkins (1961). Journal of Experimental Psychology: Learning, Memory, and Cognition 39(2), 552–572.

[3] Newell, B.R., Dunn, J.C., & Kalish, M. (2011). Systems of category learning: Fact or fantasy? In B.H. Ross (Ed) The Psychology of Learning & Motivation Vol 54,. 167-215 PDF

[4] Pinker, S. & Prince, A. (1988) On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73-193. Reprinted in S. Pinker & J. Mehler (Eds.) (1988) Connections and symbols. Cambridge, MA: MIT Press.

[5] Ullman, M. T. (2004). Contributions of neural memory circuits to language: The declarative/procedural model. Cognition, 92(1-2). 231-270.

[6] Ashby, F. G. and W. T. Maddox (2005). Human category learning. Annual Review of Psychology 56, 149–178.

[7] Sanders, L., J. Pater, C. Moore-Cantwell, R. Staubs and B. Zobel. 2014. Adults Quickly Acquire Phonological Rules from Artificial Languages. Ms., UMass Amherst, available on request.

[8] Wong PCM, Ettlinger M, Zheng J (2013) Linguistic Grammar Learning and DRD2-TAQ-IA Polymorphism. PLoS ONE 8(5): e64983. doi:10.1371/journal.pone.0064983

[9] Lewandowsky, S. (2011). Working memory capacity and categorization: Individual differences and models. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 720–738. doi:10.1037/a0022639

[10] Carroll, John B. 1981. Twenty-five years of research on foreign language aptitude. In K. C. Diller (Ed.), Individual differences and universals in language learning aptitude (pp. 83–118). Rowley, MA: Newbury House.

[11] Kaufman, S.B., DeYoung, C.G., Gray, J.R., Jimenez, L., Brown, J.B., & Mackintosh, N. (2010). Implicit learning as an ability, Cognition, 116, 321-340. [pdf]

[12] Ettlinger M, Margulis EH and Wong PC (2011) Implicit memory in music and language. Front. Psychology 2:211. doi: 10.3389/fpsyg.2011.00211

[13]Moore-Cantwell, Claire  and Lisa Sanders. 2014. Two types of implicit knowledge of probabilistic phonotactics. Poster presented at the 22nd Manchester Phonology Meeting, and LabPhon 14.