Deep learning (Le Cun et al. 2015: Nature) involves training neural networks with hidden layers, sometimes many levels deep. Frank Rosenblatt (1928-1971) is widely acknowledged as a pioneer in the training of neural networks, especially for his development of the perceptron update rule, a provably convergent procedure for training single layer feedforward networks. He is less widely acknowledged for his pioneering work with other network architectures, including multi-layer perceptrons, and models with connections “backwards” through the layers, as in recurrent neural nets. A colleague of Rosenblatt’s who prefers to remain anonymous points out that his “C-system” may even be a precursor to deep learning with convolutional networks (see esp. Rosenblatt 1967). Research on a range of perceptron architectures was presented in his 1962 book Principles of Neurodynamics, which was widely read by his contemporaries, and also by the next generation of neural network pioneers, who published the groundbreaking research of the 1980s. A useful concise overview of the work that Rosenblatt and his research group did can be found in Nagy (1991) (see also Tappert 2017). Useful accounts of the broader historical context can be found in Nilsson (2010) and Olazaran (1993, 1996).
In interviews, Yann Le Cun has noted the influence of Rosenblatt’s work, so I was surprised to find no citation of Rosenblatt (1962) in the Nature deep learning paper – it cites only Rosenblatt 1957, which has only single-layer nets. I was even more surprised to find perceptrons classified as single-layer architectures in Goodfellow et al.’s (2016) deep learning text (pp. 14-15, 27). Rosenblatt clearly regarded the single-layer model as just one kind of perceptron. The lack of citation for his work with multi-layer perceptrons seems to be quite widespread. Marcus’ (2012) New Yorker piece on deep learning classifies perceptrons as single-layer only, as does Wang and Raj’s (2017) history of deep learning. My reading of the current machine learning literature, and discussion with researchers in that area, suggests that the term “perceptron” is often taken to mean a single layer feedforward net.
I can think of three reasons that Rosenblatt’s work is sometimes not cited, and even miscited. The first is that Minsky and Papert’s (1969/1988) book is an analysis of single-layer perceptrons, and adopts the convention of referring to them as simply as perceptrons. The second is that the perceptron update rule is widely used under that name, and it applies only to single layer networks. The last is that Rosenblatt and his contemporaries were not very successful in their attempts at training multi-layer perceptrons. See Olazaran (1993, 1996) for in-depth discussion of the complicated and usually oversimplified history around the loss of interest in perceptrons in the later 1960s, and the subsequent development of backpropagation for the training of multilayer nets and resurgence of interest in the 1980s.
As for my question about whether Rosenblatt invented deep learning, that would depend on how one defines deep learning, and what one means by invention in this context. Tappert (2017), a student of Rosenblatt’s, makes a compelling case for naming him the father of deep learning based on an examination of the types of perceptron he was exploring, and comparison with modern practice. In the end, I’m less concerned with what we should call Rosenblatt with respect to deep learning, and more concerned with his work on multi-layer perceptrons and other architectures being cited appropriately and accurately. As an outsider to this field, I may well be making mistakes myself, and I would welcome any corrections.
Update August 25 2017: See Schmidhuber (2015) for an exhaustive technical history of Deep Learning. This is very useful, but it doesn’t look to me like he is appropriately citing Rosenblatt: see secs. 5.1 through 5.3. (as well as the refs. above, see Rosenblatt 1964 on the on the cat vision experiments).
Non-web available reference (ask me for a copy)
Olazaran, Mikel. 1993. A Sociological History of the Neural Network Controversy. Advances in Computers Vol. 37. Academic Press, Boston.
Tappert, Charles. 2017. Who is the father of deep learning? Slides from a presentation May 5th 2017 at PACE University, downloaded June 15th from the conference site. (Update July 2021: this has now been published in a conference proceedings paper, and is cited in a much-improved wikipedia entry for Rosenblatt).
I just skimmed about 5 pages of the e-book (which, happily, is available for downloading at http://www.dtic.mil/docs/citations/AD0256582) and would definitely call at least some of what Rosenblatt did deep learning. There are nets w/ 2 hidden layers and modifiable connections…to not call that “deep learning”, irrespective of the error-correction/training algorithm used, just seems perverse to me.
Sidebar: it’s really only this year that I’m starting to see DL being good about citing work from the 80s and 90s.
Right. One way Deep Learning is sometimes defined is as using a multilayer perceptron with more than one hidden layer, so even with that definition, Rosenblatt was using deep nets.
Other ppl were thinking about this at the same time as you…(includes quote from and link to review by Block)
Rosenblatt invented deep network (actually, elaborated on many ideas including McCulloch and Pitts work), although the depth used in the 50s and 60s (and up to the 90s) would be considered “shallow” by todays’ standards.
Unfortunately he did not invent deep LEARNING. His learning algorithm (the Perceptron Algorithm) only works for shallow learning, namely, 1-layer deep networks. Rosenblatt’s perceptrons were deep but only the last layer included learning.
I think this is a nice summary of the facts. In retrospect, the title of my post might be misconceived: it should have been “Why doesn’t Frank Rosenblatt get the credit he deserves”, or something like that. I should say, though, that he certainly *did* do learning with multilayer perceptrons – he just didn’t succeed in finding a good algorithm for it.
I agree, Rosenblatt definitely wanted to use “deep” networks, he just had a hard time training them. I think there’s a reasonable argument that what distinguishes “deep learning” from earlier connectionist models is not the idea of using multiple hidden layers (which has been around essentially since the beginning), but rather the methods that allow efficient optimization (i.e. learning) in such networks.
Having just read it, it appears that the 1962 work in which Rosenblatt proposed “four layer” networks (i.e. input, output, and two hidden layers) didn’t actually do supervised learning. Instead, it uses a more Hebbian learning rule to do unsupervised learning (it relies on examples being presented in an order that means examples of the same “class” are presented in sequence; the network learns to associate examples that are presented in close temporal proximity).
On top of issues of the quality and scalability of this algorithm, modern “deep learning” is largely focused on supervised learning problems (even if “unsupervised pre-training” is sometimes used along the way).
Still, it’s very important to be aware of this history; I suspect very few modern students of deep learning have actually read the early work to understand how much of the recent success comes down to massive improvements in vector-parallel compute hardware (i.e. GPUs), along with incremental improvements in architecture and learning rules.
The four-layer system was a step toward analyzing the “cross-coupled” system. It involved a time-dependent method of modifying layer-to-layer weights (for the interconnected two layers), which after training was turned off, but which during training tended to make various distortions and translations of the retinal stimuli produce nearly the same configurations of the associator layer, so that similar distortions and translations of any retinal stimulus would produce close-to-identical configurations of the associator layer. Thus when the associator-to-response weights were trained by the error-correction algorithm, the response-units would respond correctly to any of those distortions and translations of other stimuli. As Dave Block said, “you put the one-to-two layer training stimulus there for a LONG time.”
The simple (three-layer) perceptron was unlike a brain in that there were no associator-to-associator couplings that were time-dependent, so that what was happening after training was temporally independent and could be looked at. The four-layer system differed from it mainly in that it introduced time-dependent experience into the classification of stimuli, until the layer 1-to-layer 2 weights were fixed.
Most of the shortcomings and lacks of investigation in Rosenblatt’s expositions were related to the state of the computing art in the 1960’s. The biggest perceptron that would fit into the memory of an IBM 7090 simulation, with, say, five excitatory and five inhibitory connections per A-unit, was about 3,000 A-units, with very little room left for the generating code — tightly-written code, of course. In practice, we more often had a thousand or fewer, so that all we could glean from them were pretty general principles. When we got to the CDC 6600, we were restricted to FORTRAN IV except for really time-consuming algorithms, which were generally coded in assembly language.
Yes, the modern technical meaning of “perceptron” in the ML community is “single-layer perceptron” — not Rosenblatt’s definition. That might be because the Perceptron Learning Theorem (1961) only deals with single-layer perceptrons. So “perceptrons” are taken to be the functions to which that theorem applies. But who started referring to it as the Perceptron Learning Theorem? Not Rosenblatt 1961; maybe Minsky & Papert?
Yes – Minsky and Papert limited the instances they studied to simple perceptrons, and I do believe they wound up just calling them Perceptrons. I think I came across something that suggested Block and Rosenblatt didn’t like that – maybe Block’s review of Minsky and Papert?
It is indeed Block’s 1970 review. The relevant portion is cited in the post that Fred Mailhot shared above (http://building-babylon.net/2017/06/08/minsky-paperts-perceptrons/):
The proper name for the training algorithm was the “Error-correction” procedure. The primary theorem of perceptron theory stated that if there is a solution to the classification problem being presented to the perceptron, training it by the Error-correction procedure will find it.
I agree with this, I recently read all of his work as research for my video series on deep learning. I found he came up with CNN’s as well as back propogation (he had the idea of tuning backwards through each layer starting at the end with ‘wiggles’) it’s not true that he only trained single layer nets, it’s just that his method was slow with multi layers because he was using a binary neuron.
i plan to put lots of his work in context in my series (see “the pattern machine” on youtube)
Wow – thank you for this!
This article by Dreyfus and Dreyfus puts a little more historical context around it:
I think it’s a tragic story, especially given the fact that he died alone on his birthday soon after the events described in the article.
Thank you – I’m really enjoying this article’s description of the beginnings of the symbolic vs. statistical approaches to AI. I set it up as Chomsky vs. Rosenblatt in a recent article in Language, but Newell and Simon vs. Rosenblatt is clearly a better choice for general AI.