Category Archives: Learning

Did Frank Rosenblatt invent deep learning in 1962?

Deep learning (Le Cun et al. 2015: Nature) involves training neural networks with hidden layers, sometimes many levels deep. Frank Rosenblatt (1928-1971) is widely acknowledged as a pioneer in the training of neural networks, especially for his development of the perceptron update rule, a provably convergent procedure for training single layer feedforward networks. He is less widely acknowledged for his pioneering work with other network architectures, including multi-layer perceptrons, and models with connections “backwards” through the layers, as in recurrent neural nets. A colleague of Rosenblatt’s who prefers to remain anonymous points out that his “C-system” may even be a precursor to deep learning with convolutional networks (see esp. Rosenblatt 1967). Research on a range of perceptron architectures was presented in his 1962 book Principles of Neurodynamics, which was widely read by his contemporaries, and also by the next generation of neural network pioneers, who published the groundbreaking research of the 1980s. A useful concise overview of the work that Rosenblatt and his research group did can be found in Nagy (1991) (see also Tappert 2017). Useful accounts of the broader historical context can be found in Nilsson (2010) and Olazaran (1993, 1996).

In interviews, Yann Le Cun has noted the influence of Rosenblatt’s work, so I was surprised to find no citation of Rosenblatt (1962) in the Nature deep learning paper – it cites only Rosenblatt 1957, which has only single-layer nets. I was even more surprised to find perceptrons classified as single-layer architectures in Goodfellow et al.’s (2016) deep learning text (pp. 14-15, 27). Rosenblatt clearly regarded the single-layer model as just one kind of perceptron. The lack of citation for his work with multi-layer perceptrons seems to be quite widespread. Marcus’ (2012) New Yorker piece on deep learning classifies perceptrons as single-layer only, as does Wang and Raj’s (2017) history of deep learning. My reading of the current machine learning literature, and discussion with researchers in that area, suggests that the term “perceptron” is often taken to mean a single layer feedforward net.

I can think of three reasons that Rosenblatt’s work is sometimes not cited, and even miscited. The first is that Minsky and Papert’s (1969/1988) book is an analysis of single-layer perceptrons, and adopts the convention of referring to them as simply as perceptrons. The second is that the perceptron update rule is widely used under that name, and it applies only to single layer networks. The last is that Rosenblatt and his contemporaries were not very successful in their attempts at training multi-layer perceptrons. See Olazaran (1993, 1996) for in-depth discussion of the complicated and usually oversimplified history around the loss of interest in perceptrons in the later 1960s, and the subsequent development of backpropagation for the training of multilayer nets and resurgence of interest in the 1980s.

As for my question about whether Rosenblatt invented deep learning, that would depend on how one defines deep learning, and what one means by invention in this context. Tappert (2017), a student of Rosenblatt’s, makes a compelling case for naming him the father of deep learning based on an examination of the types of perceptron he was exploring, and comparison with modern practice. In the end, I’m less concerned with what we should call Rosenblatt with respect to deep learning, and more concerned with his work on multi-layer perceptrons and other architectures being cited appropriately and accurately. As an outsider to this field, I may well be making mistakes myself, and I would welcome any corrections.

Update August 25 2017: See Schmidhuber (2015) for an exhaustive technical history of Deep Learning. This is very useful, but it doesn’t look to me like he is appropriately citing Rosenblatt: see secs. 5.1 through 5.3. (as well as the refs. above, see Rosenblatt 1964 on the on the cat vision experiments).

Non-web available reference (ask me for a copy)

Olazaran, Mikel. 1993. A Sociological History of the Neural Network Controversy. Advances in Computers Vol. 37. Academic Press, Boston.


Tappert, Charles. 2017. Who is the father of deep learning? Slides from a presentation May 5th 2017 at PACE University, downloaded June 15th from the conference site.

Rosenblatt, with the image sensor of the Mark I Perceptron (Source: Arvin Calspan Advanced Technology Center; Hecht-Nielsen, R. Neurocomputing (Reading, Mass.: Addison-Wesley, 1990).)

The Mark 1 Perceptron (Source: Arvin Calspan Advanced Technology Center; Hecht-Nielsen, R. Neurocomputing (Reading, Mass.: Addison-Wesley, 1990).)

Moreton, Pater and Pertsova in Cognitive Science

The nearly final version of our Phonological Concept Learning paper, to appear in Cognitive Science, is now available here. The abstract is below, and we very much welcome further discussion, either by e-mail to the authors (addresses on the first page of the paper), or as comments to this post.


Linguistic and non-linguistic pattern learning have been studied separately, but we argue for a com- parative approach. Analogous inductive problems arise in phonological and visual pattern learning. Evidence from three experiments shows that human learners can solve them in analogous ways, and that human performance in both cases can be captured by the same models.

We test GMECCS, an implementation of the Configural Cue Model (Gluck & Bower, 1988a) in a Maximum Entropy phonotactic-learning framework (Goldwater & Johnson, 2003; Hayes & Wilson, 2008) with a single free parameter, against the alternative hypothesis that learners seek featurally- simple algebraic rules (“rule-seeking”). We study the full typology of patterns introduced by Shepard, Hovland, and Jenkins (1961) (“SHJ”), instantiated as both phonotactic patterns and visual analogues, using unsupervised training.

Unlike SHJ, Experiments 1 and 2 found that both phonotactic and visual patterns that depended on fewer features could be more difficult than those that depended on more features, as predicted by GMECCS but not by rule-seeking. GMECCS also correctly predicted performance differences between stimulus subclasses within each pattern. A third experiment tried supervised training (which can fa- cilitate rule-seeking in visual learning) to elicit simple-rule-seeking phonotactic learning, but cue-based behavior persisted.

We conclude that similar cue-based cognitive processes are available for phonological and visual concept learning, and hence that studying either kind of learning can lead to significant insights about the other.

Calamaro and Jarosz on Synthetic Learner blog

On the Synthetic Learner blog, Emmanuel Dupoux recently posted some comments on a paper co-authored by Gaja Jarosz and Shira Calamaro that recently appeared in Cognitive Science. Gaja has also written a reply. While you are there, take a peek around the blog, and the Bootphon website: Dupoux has a big and very interesting project on unsupervised learning of words and phonological categories from the speech stream.

Wellformedness = probability?

There are some old arguments against probabilistic models as models of language, but these do not seem to have much force anymore, especially because we now have models that can compute probabilities over the same representations that we use in generative linguistics (Andries Coetzee and I have an overview of probabilistic models of phonology in our Handbook chapter, Mark Johnson has a nice explanation of the development of MaxEnt models and how they differ from PCFG’s as well as other useful material on probabilistic models as models of language learning, Steve Abney has a provocative and useful piece about how the goals of statistical computational linguistics can be seen as the goals of generative linguistics; see more broadly the recent debate between Chomsky and Peter Norvig on probabilistic approaches to AI; see also the Probabilistic Linguistics book and Charles Yang’s review).

That’s not to say that there can’t be issues in formalizing probabilistic models of language. In a paper to appear in Phonology (available here) Robert Daland discusses issues that can arise in defining a probability distribution over the infinite set of possible words, in particular with Hayes and Wilson’s (2008) MaxEnt phonotactic grammar model. In the general case, for this to succeed, the probability of strings of increasing length must decrease sharply enough such that the sum of their probabilities never exceeds 1, and simply continues to approach it. Daland defines the conditions under which this will obtain in the Hayes and Wilson model in terms of the requirements on the weight of a *Struc constraint that assigns a penalty that increases as string length increases.

In the question period after Robert’s presentation of this material at the GLOW computational phonology workshop in Paris in April, Jeff Heinz raised an objection against the general notion of formalizing well-formedness in terms of probabilities, and he repeated this argument at the Manchester fringe workshop last week. Here’s my reconstruction of it (hopefully Jeff will correct me if I get it wrong – I also don’t have the references to the earlier work that made this argument). Take a (relatively) ill-formed short string. Give it some probability. Now take a (relatively) well-formed string. Give it some probability. Now concatenate the well-formed string enough times until the whole thing has probability lower than the ill-formed string, which it eventually will.

This is meant to be a paradox for the view that we can formalize well-formedness in terms of probabilities: the long well-formed string has probability lower than the short ill-formed string. It’s not clear to me, however, that there is a problem (and it wasn’t clear to Robert Daland either – the question period discussion lasted well into lunch, with Ewan Dunbar taking up Jeff’s position at our end of the table). Notice that Jeff’s argument is making an empirical claim that the concatenation of the well-formed strings does not result in a well-formedness decrease. When I talked to him last week, he claimed that this is clearer in syntax than phonology. Robert’s position (which I agree with) is that it likely does – though from his review of the literature on phonotactic well-formedness judgments we don’t seem to have empirical data on this point.

Robert asked us to work with him in designing the experiment, and at the time I wasn’t sure that this was the best use of our lunch time, but I think he has a point. If this is in fact an empirical issue, and we can agree beforehand on how to test it, then this would save a lot of time compared with the usual process of the advocates of one position designing an experiment, which even if it turns out the way they hope, can then be criticized by the advocates of the other position as not having operationalized their claim properly, and so on…

It’s also of course possible that this is not an empirical issue: that there is a concept of perfect well-formedness that probabilistic models cannot capture. This reminds me of a comment on a talk I got once from a prominent syntactician when I discussed probabilistic models that can give probability vanishingly close to zero to ill-formed structures: “but there are sentences that I judge as completely out for English – they should have probability zero”. My response was to simply repeat the phrase vanishingly close to zero, and check to make sure he knew what I meant.

Implicit and explicit learning

In this post, I want to explain why I’ve recently become interested in distinctions between implicit and explicit learning (also procedural and declarative memory – see below on the connection), and provide a quick overview of the literature I’ve been able to assimilate, as well as mention some things that seem particularly interesting to investigate with respect to phonological learning. The literature is vast, and there is thus far very little that has been done in this area in phonology (with the notable exception of the English past tense), so there’s lots of room for reading, thinking, talking and research, and I’d really welcome others’ thoughts!

The distinction came up for me in joint research with Elliott Moreton and Katya Pertsova [1]. We (EM and JP) developed a MaxEnt model of phonotactic learning, which we found out was virtually identical to a model of visual category learning that had been developed, and then abandoned, in the late 80s. It was abandoned because it made the wrong predictions about the relative difficulty of types of visual category. To our initial surprise, when EM and KP tested the learning of the phonotactic analogues of those category types, the predictions of our model were supported. After some more thought and reading of the psychological literature, the result seemed less surprising, but no less interesting. The classic visual category learning experiments are prototypical exercises in explicit learning: you are told you are to learn a rule separating two types of object, the relevant features are extremely salient and verbalizable (e.g. color, shape, size), in training you are asked to classify objects and are given feedback, and you are sometimes even asked for your current hypothesis about the rule. When the methodology is changed so that learning becomes even somewhat less explicit (even by omitting the instruction to seek rules [2]) the order of difficulty changes in the direction of the predictions of our model. Phonotactic learning is typically (and naturally) more implicit: the relevant phonological features are difficult, if not impossible to verbalize, and training proceeds by providing only positive examples of one category (i.e. words that are “in” the language). Our conclusion / hypothesis is that implicit (typical phonotactic) learning is well characterized by “cue-based” models like ours, while explicit (classic visual category) learning is not – “rule-based” models do better in that domain. See our paper on what we mean by “cue-based” and “rule-based”.

Implicit / explicit learning distinctions have a long and controversial history in psychology, and also connect to a very heated debate across psychology and linguistics. The recent debate in category learning in cognitive psychology [3], like the older psycho-linguistic debate [4], revolves around the question of whether there are two separate learning systems. Two systems models in both domains make links to a distinction between declarative and procedural memory ([5], [6]). As far as I can tell, no connections have yet been made between these two two systems literatures (I haven’t even found any cross-references yet). Future attempts to make these connections will need to confront the fact “rule-based learning” is implicit in the psycho-linguistic literature, but explicit in category learning. The two systems view in language makes a distinction between knowledge of words (explicit, declarative) and knowledge of rules for how words combine (implicit, procedural) – a quite famous distinction, thanks to Pinker. The term “rule” in category learning is by definition a generalization that one can make explicitly, and is thus linked with declarative memory. So those are the differences. But there are some ways in which the meaning of rule does overlap across the domains – rules are supposed to express relatively simple generalizations (e.g. broad in application in linguistics, with sometimes a complexity metric applied to choose amongst rules, simple in terms of featural makeup in visual category learning).

In thinking and talking about the potential extensions of two systems ideas to phonology, I prefer the terms “explicit” vs. “implicit” over “declarative” vs. “procedural” as well as “words” vs. “rules”. On the first alternative, there is also no obvious way in which knowledge of phonotactics is procedural. On the second, the relatively implicit learning that we have been studying, the learning of phonotactics, is the learning of generalizations over the phonological shape of words – “rules” about “words”. Some evidence that there is a common cognitive underpinning between syntactic rules and phonotactic rules comes from the fact that both can elicit a Late Positive Component / P600 in ERP studies (see [7] for some recent work and references).

A clear direction for research is to try to get evidence of which memory systems (assuming that there are in fact multiple systems) are recruited in different types of phonological learning. A pioneering study in this respect [8] finds evidence of declarative memory being linked to what they call analogy (in fact an opaque interaction), and procedural memory linked to simple concatenation of an affix. There are a number of issues with this study, and lots of directions for further work. One of its further results that I find intriguing is that no correlation was found between working memory and in success in phonological learning. There seems to be a quite solid association between working memory and success in the classic visual category learning paradigm mentioned above [9]. Does working memory capacity correlate with success in phonotactic learning? (Probably not.) With any other aspect of phonological learning? (Probably: see the literature on individual differences and the phonological loop, which is a type of working memory.) More broadly, are there robust individual differences in phonotactic learning and other kinds of phonological learning? One reason to be optimistic that artificial language learning might lead to insight into individual differences is that it seems to have been used in as one of the measures in all of the language learning aptitude tests that are predictive of success in natural language learning [10]. Individual differences in implicit learning tend not to be very robust (see [11] for a review), so if we find a measure that has good test-retest reliability, this could be a contribution of broad interest (has anyone done test-retest with Saffran-style TP learning?) There are interesting potential connections between implicit learning in language and music that seem worth exploring [12]. And dreaming really big, we might imagine doing very large scale studies of individual differences over the web, and collecting genetic samples. Finally, getting back to phonological theory (and to earth), if it is the case that different aspects of phonology are learned using different cognitive sub-systems (see [13] for some recent ERP evidence), this should have deep consequences for how we model knowledge of phonology and its learning.

[1] Moreton, Elliott, Joe Pater and Katya Pertsova. 2014. Phonological concept learning. Ms, University of North Carolina and University of Massachusetts Amherst. Resubmitted to Cognitive Science August 2014, revised version of 2013 paper, comments still very much welcome.

[2]  Kurtz, K. J., K. R. Levering, R. D. Stanton, J. Romero and S. N. Morris (2013). Human learning of elemental category structures: revising the classic result of Shepard, Hovland, and Jenkins (1961). Journal of Experimental Psychology: Learning, Memory, and Cognition 39(2), 552–572.

[3] Newell, B.R., Dunn, J.C., & Kalish, M. (2011). Systems of category learning: Fact or fantasy? In B.H. Ross (Ed) The Psychology of Learning & Motivation Vol 54,. 167-215 PDF

[4] Pinker, S. & Prince, A. (1988) On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73-193. Reprinted in S. Pinker & J. Mehler (Eds.) (1988) Connections and symbols. Cambridge, MA: MIT Press.

[5] Ullman, M. T. (2004). Contributions of neural memory circuits to language: The declarative/procedural model. Cognition, 92(1-2). 231-270.

[6] Ashby, F. G. and W. T. Maddox (2005). Human category learning. Annual Review of Psychology 56, 149–178.

[7] Sanders, L., J. Pater, C. Moore-Cantwell, R. Staubs and B. Zobel. 2014. Adults Quickly Acquire Phonological Rules from Artificial Languages. Ms., UMass Amherst, available on request.

[8] Wong PCM, Ettlinger M, Zheng J (2013) Linguistic Grammar Learning and DRD2-TAQ-IA Polymorphism. PLoS ONE 8(5): e64983. doi:10.1371/journal.pone.0064983

[9] Lewandowsky, S. (2011). Working memory capacity and categorization: Individual differences and models. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 720–738. doi:10.1037/a0022639

[10] Carroll, John B. 1981. Twenty-five years of research on foreign language aptitude. In K. C. Diller (Ed.), Individual differences and universals in language learning aptitude (pp. 83–118). Rowley, MA: Newbury House.

[11] Kaufman, S.B., DeYoung, C.G., Gray, J.R., Jimenez, L., Brown, J.B., & Mackintosh, N. (2010). Implicit learning as an ability, Cognition, 116, 321-340. [pdf]

[12] Ettlinger M, Margulis EH and Wong PC (2011) Implicit memory in music and language. Front. Psychology 2:211. doi: 10.3389/fpsyg.2011.00211

[13]Moore-Cantwell, Claire  and Lisa Sanders. 2014. Two types of implicit knowledge of probabilistic phonotactics. Poster presented at the 22nd Manchester Phonology Meeting, and LabPhon 14.




Categorical correctness in MaxEnt hidden structure learning

In our 2012 SIGMORPHON paper, we propose the following measure of categorical success in MaxEnt learning with hidden structure, in this case Underlying Representations (URs) given only observed Surface Representations (SRs) (pp. 67-68):

 Our objective function is stated in terms of maximizing the summed probability of all (UR, SR) pairs that have the correct SR, and an appropriate criterion is therefore to require that the summed probability over full structures be greater for the correct SR than for any other SR. We thus term this simulation successful. We further note that given a MaxEnt grammar that meets this criterion, one can make the probabilities of the correct forms arbitrarily close to 1 by scaling the weights (multiplying them by some constant).

Unfortunately, the claim in the last sentence in false, and our success criterion does not seem stringent enough, since a grammar that meets it is not necessarily correct in the sense we would like.

Here’s a simple counter-example to that claim, involving metrical structure rather than URs. We have a trisyllable that has two parsings that generate medial stress, and a single parsing that gets us each initial and final. Stress is a capital A, and footing is shown in parentheses. These probabilities come from zero weights on all constraints, except “Iamb”, which wants the foot to be right-headed, and thus penalizes candidates 2 and 3. Here Iamb has weight 0.1.

1. batAma (batA)ma             0.2624896

2. batAma ba(tAma)             0.2375104

3. bAtama (bAta)ma             0.2375104

4. batamA ba(tamA)             0.2624896

The summed probability of rows 1. and 2. is 0.50, and thus this grammar meets our definition of success if the target language has medial stress. But no matter how high we increase the weight of Iamb, we will never get that sum to exceed 0.50 (another demonstration would have been just to have the weights at zero, since scaling will have no effect, and batAma will again have 0.50 probability). A correct grammar in the sense we would like also needs to include non-zero weight on a constraint that prefers 1. over 4 (e.g. Align-Left).

So what’s the right definition? One obvious possibility would be to require a single correct candidate to have the highest probability, which corresponds to a categorical version of HG (see this paper for some discussion of the relationship between categorical HG and MaxEnt), but that seems wrong given our objective function, which doesn’t have that structure (though see my comment on this post for more on this). Another would be to require some arbitrary amount of probability on the correct form, but we could construct another counter-example simply by making the set of parses that correspond to one overt form sufficiently large w.r.t. to the others. It seems the right answer would involve knowing the conditions under which it is in fact true that scaling will bring probabilities arbitrarily close to 1, but I don’t know what they are when hidden structure is involved.