The Society for Computation in Linguistics has been launched with a call for papers at its inaugural meeting in January 2018. The deadline is August 1. Join the mailing list to stay informed on this and future events.
The Society for Computation in Linguistics has been launched with a call for papers at its inaugural meeting in January 2018. The deadline is August 1. Join the mailing list to stay informed on this and future events.
Deep learning (Le Cun et al. 2015: Nature) involves training neural networks with hidden layers, sometimes many levels deep. Frank Rosenblatt (1928-1971) is widely acknowledged as a pioneer in the training of neural networks, especially for his development of the perceptron update rule, a provably convergent procedure for training single layer feedforward networks. He is less widely acknowledged for his pioneering work with other network architectures, including multi-layer perceptrons, and models with connections “backwards” through the layers, as in recurrent neural nets. A colleague of Rosenblatt’s who prefers to remain anonymous points out that his “C-system” may even be a precursor to deep learning with convolutional networks (see esp. Rosenblatt 1967). Research on a range of perceptron architectures was presented in his 1962 book Principles of Neurodynamics, which was widely read by his contemporaries, and also by the next generation of neural network pioneers, who published the groundbreaking research of the 1980s. A useful concise overview of the work that Rosenblatt and his research group did can be found in Nagy (1991) (see also Tappert 2017). Useful accounts of the broader historical context can be found in Nilsson (2010) and Olazaran (1993, 1996).
In interviews, Yann Le Cun has noted the influence of Rosenblatt’s work, so I was surprised to find no citation of Rosenblatt (1962) in the Nature deep learning paper – it cites only Rosenblatt 1957, which has only single-layer nets. I was even more surprised to find perceptrons classified as single-layer architectures in Goodfellow et al.’s (2016) deep learning text (pp. 14-15, 27). Rosenblatt clearly regarded the single-layer model as just one kind of perceptron. The lack of citation for his work with multi-layer perceptrons seems to be quite widespread. Marcus’ (2012) New Yorker piece on deep learning classifies perceptrons as single-layer only, as does Wang and Raj’s (2017) history of deep learning. My reading of the current machine learning literature, and discussion with researchers in that area, suggests that the term “perceptron” is often taken to mean a single layer feedforward net.
I can think of three reasons that Rosenblatt’s work is sometimes not cited, and even miscited. The first is that Minsky and Papert’s (1969/1988) book is an analysis of single-layer perceptrons, and adopts the convention of referring to them as simply as perceptrons. The second is that the perceptron update rule is widely used under that name, and it applies only to single layer networks. The last is that Rosenblatt and his contemporaries were not very successful in their attempts at training multi-layer perceptrons. See Olazaran (1993, 1996) for in-depth discussion of the complicated and usually oversimplified history around the loss of interest in perceptrons in the later 1960s, and the subsequent development of backpropagation for the training of multilayer nets and resurgence of interest in the 1980s.
As for my question about whether Rosenblatt invented deep learning, that would depend on how one defines deep learning, and what one means by invention in this context. Tappert (2017), a student of Rosenblatt’s, makes a compelling case for naming him the father of deep learning based on an examination of the types of perceptron he was exploring, and comparison with modern practice. In the end, I’m less concerned with what we should call Rosenblatt with respect to deep learning, and more concerned with his work on multi-layer perceptrons and other architectures being cited appropriately and accurately. As an outsider to this field, I may well be making mistakes myself, and I would welcome any corrections.
Update August 25 2017: See Schmidhuber (2015) for an exhaustive technical history of Deep Learning. This is very useful, but it doesn’t look to me like he is appropriately citing Rosenblatt: see secs. 5.1 through 5.3. (as well as the refs. above, see Rosenblatt 1964 on the on the cat vision experiments).
Non-web available reference (ask me for a copy)
Olazaran, Mikel. 1993. A Sociological History of the Neural Network Controversy. Advances in Computers Vol. 37. Academic Press, Boston.
Tappert, Charles. 2017. Who is the father of deep learning? Slides from a presentation May 5th 2017 at PACE University, downloaded June 15th from the conference site.
The 10th NECPhon will take place at UMass Amherst on Saturday 9/24. The talks, breaks, and lunch will all take place in/around N400 in the Department of Linguistics, which is in the Integrative Learning Center (650 N. Pleasant St). It is the building directly north of the pond on the map here.
Parking is free on weekends at most university parking lots (all those not circled on the map as 24hr enforced). I would suggest lots 62, 63, or 64 for proximity to the department.
Please see below for the schedule.
11-11:30 Arrive & Welcome
1-2 Lunch (provided)
5:30 Business Meeting
Continuing the very useful discussion we’ve been having on the new conference on Computational and Mathematical Modeling in Linguistics (here and here), I’d like to invite further discussion of the choice to do short paper (6-8pp) submissions instead of the usual abstract submissions for linguistics conferences. We had a little bit of discussion of some pros and cons of this choice on the original post, mostly relating to the potential conflict/competition of publishing something there as opposed to ACL or CogSci. Kyle Rawlins recently raised a number of potential issues with us by email, and I’d like to relay some of these concerns and invite more general discussion of these and other considerations.
To summarize Kyle’s concerns (Kyle, please feel free to comment below to expand on or correct anything):
It’s clear that this would be a novel approach for linguistics and that this approach could potentially discourage participation of linguists, which is not our goal. So, the other side of the equation is – is it worth it? What are the advantages of this approach and would these advantages outweigh these or other potential costs? I advocated for paper submission, hoping that peer-review would improve the quality of the work presented at the conference and have the potential to elevate the status of the papers published there as well as the status of the conference itself. Could this status be elevated enough for these papers to count as short journal papers, on par with brief articles or squibs in journals, for purposes of tenure-review and hiring in linguistics departments? And if not, how problematic is this?
What do you think about the relative risks and potential benefits of this approach? What other considerations are there?
In our previous post, we hosted a discussion about a new conference for linguists and cognitive scientists on computational and mathematical modeling. In this post I’d like to solicit comments and suggestions about possible names for the conference. Before I lay out some existing suggestions for commentary, I’d like to summarize the overall plan and goals for the conference that emerged from that discussion:
Highest Priority Goals
1) We need to attract the core constituents to this conference, especially the first meeting. The core constituents are linguists/cognitive scientists who rely on computational/mathematical approaches and are concerned with questions about the human language faculty.
2) The conference should be accessible and affordable to linguists, including students. (to repeat from earlier, this rules out co-locating with ACL)
3) The conference should have quality, peer-reviewed paper submissions. I see this as an important move for the field of linguistics in general, not just this conference. This does not rule out the possibility discussed in the comments above of also having submissions of other kinds, such as presentation-only submissions which have possibly appeared elsewhere.
4) We want the meeting to be sustainable long-term, with room to become a 2-3 day ‘go-to’ event in linguistics/cognitive science.
High Priority Goals
5) Ideally, the conference would alternate US-Europe every other year rather than being solely a US conference to be inclusive of the international community.
6) Ideally, the conference would be a welcoming/accessible place to linguists who want to learn more about computational/mathematical approaches but don’t (yet) do that sort of work themselves. One way to do this would be to introduce a half-day of workshops or tutorials to initiate the conference. I’m not necessarily proposing this for our first meeting, but something to keep in mind for later.
7) Avoiding Balkanization. As we set up specialized conferences, we may contribute to the balkanization of our field (e.g. we may pull computational work out of AMP). To some extent this balkanization is an inevitable consequence of the specialization that is occurring as linguistics grows, but if we can avoid it, so much the better.
8) Increasing the participation of underrepresented groups in computational linguistics.
Overall Tentative/Consensus Plan
1) The first meeting is to be tentatively held at UMass in Fall 2017 in conjunction with a one-time workshop on computational modeling of language (invited speakers, pending funding, include Jacob Andreas, Emily Bender, Sam Bowman, Chris Dyer, Jason Eisner, Bob Frank, Matt Goldrick, Sharon Goldwater, and Paul Smolensky). The exact schedule is unknown at this point, but tentatively the new conference may be on a Friday or a Thursday-Friday, with the workshop probably Saturday-Sunday.
2) The second meeting is scheduled to be in Paris in Fall/Winter 2018, organized by Giorgio Magri.
3) We will have a general discussion of hosting options for subsequent meetings at the first meeting at UMass. One prominent possibility is holding the third meeting in conjunction with the LSA annual meeting in New Orleans in Jan. 2020.
4) The current plan is still to have paper submissions, possibly published with the ACL anthology (though stay tuned for another post to discuss this further).
Ok, so on to the candidate names! I think the current favorite in offline discussions among us is “Computational and Mathematical Modeling in Linguistics” with the acronym CAMML or CAMMIL or maybe even CAMMLing or CAMMILing. What do you think? I like that it is clearly about linguistics and that it is inclusive of both computational and mathematical approaches, and that it has a cute and pronounceable acronym. Earlier variants had “Linguistic Theory (LT)” or “Theoretical Linguistics (TL)” in them (like CLINT, CALT, CAMLT, or CATL, etc), there is also the option to add “meeting” (M) or “annual meeting” (AM) or “Society” (S) or “conference” (C) somewhere (yielding things like CALM, AMCTL, SCATL, etc). I’m sure there are many other possibilities, but I will leave off here with my favorite: CAMML (or is it CAMMLing).
A group of us have recently been discussing the possibility of a new conference on computational approaches to linguistics (group=Rajesh Bhatt, Brian Dillon, Gaja Jarosz, Giorgio Magri, Claire Moore-Cantwell, Joe Pater, Brian Smith, and Kristine Yu). We’ll provide some of the content of that discussion in a moment (we=Gaja and Joe), but the main question we’d like to get on the table is where the first meeting of that conference should be held. It’s so far agreed that it should be co-located with some other event to increase participation (at least for the first meeting), and the end of 2017 / beginning 2018 seems like the right time to do it. The ideas currently under discussion are:
We’d very much appreciate your thoughts on the location or the substance of the conference as comments below, or use this google form to give a non-public response.
The original idea was to start a computational phonology conference, inspired by the success of the informal meetings that we’ve had as the North East Computational Phonology Circle, and by the central place that computational work has in phonology these days. But Giorgio pointed out that a broader meeting might well be of interest, and we seem to have come to a consensus that he’s likely right. It doesn’t seem like there is a general venue for computational linguistics of the non-engineering-focused kind, though we are aware of successful workshops that have been held at the ACL and elsewhere (e.g. Sigmorphon, MOL, CMCL). These workshops are in fact also part of the inspiration for this; however, the conference we envision would be broader in scope and co-located with a major linguistics conference to attract as many linguists as possible, minimize costs, and minimize additional conference travel.
We still think that a core contingent might well be the computational phonologists, especially at first, so we still think co-locating it with AMP might make sense (plus NYC is a good location). We’ve also had suggestions that we might in some years co-locate with other conferences, like NELS – the location of future meetings is something we could discuss in person at the first one.
We also seem to have come to a current consensus that we’d like to have reviewed short papers in the CS / CogSci tradition. This is an extremely efficient way to get research out. The one worry that was expressed was that this may create a barrier to later journal publication, but at least two journals with explicit policies on this (Cognitive Science and Phonology) allow publication of elaborated versions of earlier published conference papers.
Please share this post or the tweet below!
A new conference for computational *linguists*? Where and when? https://t.co/U9Do3nM3cx
— CompPhon@UMass (@comphonumass) July 29, 2016
From an e-mail from Paul Smolensky, March 28, 2015. Even though he wasn’t doing phonology in the mid-1980’s when he coined the term “Harmony Theory”, Paul had apparently taken a course on phonology with Jorge Hankamer and found vowel harmony fascinating.
“Harmony” in “Harmony Theory” arises from the fact that the Harmony function is a measure of *compatibility*; the particular word was inspired by vowel harmony, and by the letter ‘H’ which is used in physics for the Hamiltonian or energy function, which plays in statistical mechanics the same mathematical role that the Harmony function plays in Harmony theory: i.e., the function F such that prob(x) = k*exp(F(x)/T).(Although I took the liberty of changing the sign of the function; in physics, it’s p(x) = k*exp(–H(x)/T), in Harmony Theory, it’s p(x) = k*exp(H(x)/T). That’s because it drove me crazy working in statistical mechanics that that minus sign kept coming and going and coming and going from equation to equation, leading to countless errors; I just dispensed with it at the outset and cut all that nonsense off at the pass.)
I always thought the reason why the physicists had a minus sign in the exponential was that otherwise temperatures would have to be negative. But I guess you can push the negation into the Hamiltonian, which is perhaps what Paul did.
Yes, that’s just what I did. Instead of minimizing badness I switched to maximizing goodness. I’m just that kind of guy.
Probabilities are never greater than one, so log probabilities are always less than or equal to zero. So a negative log likelihood is always a positive quantity, and smaller negative log likelihood values are associated with more likely outcomes. So one way to understand the minus sign in the Gibbs-Boltzmann distribution is that it makes H(x) correspond to a negative log likelihood.
But I think one can give a more detailed explanation.
In a Gibbs-Boltzmann distribution p(x) = k*exp(–H(x)/T), H(x) is the energy of a configuration x.
Because energies H(x) are non-negative (which follows from the definition of energy?), and given a couple of other assumptions (e.g., that there are an infinite number of configurations and energies are unbounded — maybe other assumptions will do?), it follows that probability must decrease with energy, otherwise the inverse partition function k would not exist (i.e., the probability distribution p(x) would not sum to 1).
So if the minus sign were not there, the temperature T (which relates energy and probability) would need to be negative. There’s no mathematical reason why we couldn’t allow negative temperatures, but the minus sign makes the factor T in the formula correspond much closer with our conventional understanding of temperature.
In fact, I think it is amazing that the constant T in the Gibbs-Boltzmann formula denotes exactly the pre-statistical mechanics concept of temperature (well, absolute temperature in Kelvin). In many other domains there’s a complex relationship between a physical quantity and our perception of it; what is the chance of a simple linear relationship like this for temperature?
But perhaps it’s not a huge coincidence. Often our perceptual quantities are logarithmically related to physical quantities, so perhaps its no accident that T is inside the exp() rather than outside (where it would show up as an “exponential temperature” term). And the concept of temperature we had before Gibbs and Boltzmann wasn’t just a naive perception of warmth; there had been several centuries of careful empirical work on properties of gases, heat engines, etc., which presumably lead scientists to the right notion of temperature well before the Gibbs-Boltzmann relationship was discovered.
Here are some quick thoughts.0. Energy E in physics is positive. That’s what forces the minus sign in p(x) \propto exp(—E(x)/T), as Mark observes.
Assuming x ranges over an infinite state space, the probability distribution can only be normalized to sum to one if the exponent approaches zero as x -> infinity, and if E(x) > 0 and T > 0, this can only happen if E(x) -> infinity as x -> infinity and we have the minus sign in the exponent.1. Why is physical E > 0?
2. Perhaps the most fundamental property of E is that it is conserved: E(x(t)) = constant, as the state of an isolated physical system x(t) evolves in time t. From that point of view there’s no reason that E > 0; any constant value would do.3. For a mechanical system, E = K + V, the sum of the kinetic energy K derived from the motion of the massive bodies in the system and the potential energy V. Given Newton’s second law, F = ma = m dv/dt, E is conserved when F = — grad V and K = mv^2/2
then dE/dt = d(mv(t)^2/2)/dt + dV(x(t))/dt = mv dv/dt + dx/dt . grad V = v(ma) + v(—F) = 0; that’s where the — sign in —grad V comes from.Everything in the equation E = K + V could be inverted, multiplied by —1, without change in the conservation law. But the commonsense meaning of “energy” is something that should increase with v, hence K = mv^2/2 rather than —mv^2/2.
4. Although K = mv^2/2 > 0, V is often negative.E.g., for the gravitational force centered at x = 0, F(x) = —GmM x/|x|^3 = —grad V if V(x) = —GmM/|x| < 0(any constant c can be added to this definition of V without consequence; but even so, for sufficiently small x, V(x) < 0)Qualitatively: gravitational force is attractive, directed to the origin in this case, and this force is —grad V, so grad V must point away from the origin, so V must increase as x increases, i.e., must decrease as x decreases. V must fall as 1/|x| in order for F to fall as 1/|x|^2 so the decrease in V as x —> 0 must take V to minus infinity.
5. In the cognitive context, it’s not clear there’s anything corresponding to the kinetic energy of massive bodies. So it’s not clear there’s anything to fix a part of E to be positive; flipping E by multiplication by —1 doesn’t seem to violate any intuitions. Then, assuming we keep T > 0, we can (must) drop the — in p(x) \propto exp(—E(x)/T) = exp(H(x)/T) where we define Harmony as H = —E. Now the probability of x increases with H(x); lower H is avoided, hence higher H is “better”, hence the commonsense meaning of “Harmony” has the right polarity.
E-mail from Mark Johnson March 27, 2016
Very nice! I was thinking about kinetic energy, but yes, potential energy (such as gravitational energy) is typically conceived as negative (I remember my high school physics class, where we thought of gravitational fields as “wells”). I never thought about how this is forced once kinetic energy is positive.
Continuing in this vein, there are a couple of other obvious questions once one thinks about the relationship between Harmony theory and exponential models in physics.
For example, does the temperature T have any cognitive interpretation? That is, is there some macroscopic property of a cognitive system that T represents?
More generally, in statistical mechanics the number (or more precisely, the density) of possible states or configurations varies as a function of their energy, and there are so many more higher energy states than lower energy ones that the typical or expected value of a physical quantity like pressure is not that of the more probable low energy states, but instead determined by the more numerous, less probable higher energy states.
I’d be extremely interested to hear if Paul knows of any cases where this or something like it occurs in cognitive science. I’ve been looking for convincing cases ever since I got interested in Bayesian learning! The one case I know of has to do with “sparse Dirichlet priors”, and it’s not exactly overwhelming.
E-mail from Paul Smolensky, March 27, 2016
The absolute magnitude of T has no significance unless the absolute magnitude of H does, which I doubt. So I’d take Mark’s question about T to come down to something like: what’s the cognitive significance of T —> 0 or T —> infinity or T ~ O(1)?
And I’d look for answers in terms of the cognitive role of different types of inference. T —> 0 gives maximum-likelihood inference; T —> infinity gives uniform sampling; T ~ O(1) gives sampling from the distribution exp(H(x)). Mark, you’re in a better position to interpret the cognitive significance of such inference patterns.
As for the question of density of states of different Harmony/energy, the (log) density of states is essentially the entropy, so any cognitive significance entropy may have — e.g., entropy reduction as predictor of incremental sentence processing difficulty à la Hale — qualifies as cognitive relevance of density of states. As for the average value of a quantity reflecting less-probable-but-more-numerous states more than more-probable states, I’m not sure what the cognitive significance of average values is in general.
“No data, yet combines two or more of the worst phonological theories, resulting in an account that is far more complicated and assumption-laden than the simple if typologically odd pseudo-example given.”
I received this review on an abstract I submitted recently. I’ve gotten plenty of bad reviews in the sense of them being negative, but I’ve never gotten one that was so unprofessional, and that made it so clear that the reviewer hadn’t engaged with the abstract in anything but the most superficial fashion. Because I didn’t think this reviewer was doing their job, I was moved to complain about it. I did so as follows:
“I’ve never complained about a conference review before, but this is one’s beyond the pale. I don’t want you to do anything about it, but I had to tell you I’m pretty shocked by it.”
The conference organizer reported that the program committee agreed that the review was unprofessional, and that this reviewer, along with another who had engaged in “soapboxing or axe-grinding”, would not be included in the list of reviewers passed on to the next year’s organizer.
I was pleased with this outcome, and I thought I’d tell this story because this seemed like a good way of improving the quality of reviewer pools that others might usefully adopt. I’d also be happy if this contributed to a general discussion of what the expectations are for reviews, and how we can make them better.
The nearly final version of our Phonological Concept Learning paper, to appear in Cognitive Science, is now available here. The abstract is below, and we very much welcome further discussion, either by e-mail to the authors (addresses on the first page of the paper), or as comments to this post.
Linguistic and non-linguistic pattern learning have been studied separately, but we argue for a com- parative approach. Analogous inductive problems arise in phonological and visual pattern learning. Evidence from three experiments shows that human learners can solve them in analogous ways, and that human performance in both cases can be captured by the same models.
We test GMECCS, an implementation of the Configural Cue Model (Gluck & Bower, 1988a) in a Maximum Entropy phonotactic-learning framework (Goldwater & Johnson, 2003; Hayes & Wilson, 2008) with a single free parameter, against the alternative hypothesis that learners seek featurally- simple algebraic rules (“rule-seeking”). We study the full typology of patterns introduced by Shepard, Hovland, and Jenkins (1961) (“SHJ”), instantiated as both phonotactic patterns and visual analogues, using unsupervised training.
Unlike SHJ, Experiments 1 and 2 found that both phonotactic and visual patterns that depended on fewer features could be more difficult than those that depended on more features, as predicted by GMECCS but not by rule-seeking. GMECCS also correctly predicted performance differences between stimulus subclasses within each pattern. A third experiment tried supervised training (which can fa- cilitate rule-seeking in visual learning) to elicit simple-rule-seeking phonotactic learning, but cue-based behavior persisted.
We conclude that similar cue-based cognitive processes are available for phonological and visual concept learning, and hence that studying either kind of learning can lead to significant insights about the other.
On the Synthetic Learner blog, Emmanuel Dupoux recently posted some comments on a paper co-authored by Gaja Jarosz and Shira Calamaro that recently appeared in Cognitive Science. Gaja has also written a reply. While you are there, take a peek around the blog, and the Bootphon website: Dupoux has a big and very interesting project on unsupervised learning of words and phonological categories from the speech stream.