Author Archives: Joe Pater

Remembering Frank Rosenblatt

Frank Rosenblatt died tragically young 50 years ago, in a boating accident on his 43rd birthday on July 11, 1971. These reminiscences by Terry Koken, previously published only in the talk section of Rosenblatt’s wikipedia entry, give some rare and valuable insight into his intellectual genius and his life.

Part 1 (originally published April 15, 2009)

I worked for Rosenblatt on the Cornell Cognitive Systems Research Program from June 1960 to January 1962, and again from January 1965 to June of 1968. He and I roomed together at 125 College Ave. during the first stint, and I later lived in his house on Middaugh road in Brooktondale. In some ways I probably can claim responsibility for his buying the house.

I was an almost-lifelong astronomy buff, and thought I would eventually make it my profession. As a sixteen-year-old freshman at Cornell, I borrowed some paper from Rosenblatt in the Willard Straight Music room, and became acquainted with him. I must have in some way impressed him, as he stopped by the squalid quarters I was inhabiting in Syracuse shortly after my 18th birthday and offered me a job on his research program at Cornell, programming digital computers. It paid a little more than two bucks an hour, for that time an almost unheard-of wage for one of my tender years. I took it, of course.

Rosenblatt was a confirmed bachelor, and evidenced no visible interest in the opposite sex that I could discern. Whether he had any interest in the same sex is something none of us who worked for him ever speculated on; he was eccentric, and fun-loving, and had a hell of a good sense of humor, and such speculation just didn’t seem to matter. At any rate, my enthusiasm for astronomy must have been contagious, and at the same time he was looking for something to spend an accumulation of cash against, and he decided to buy a telescope. He chose a Fecker 12-1/2″ Cassegrain, complete with equatorial mount and drive, to the tune of about three grand worth, which was roughly nine months’ pay for me; an almost overwhelming amount. The instrument was custom-built, and it took some time to arrive (I think more than six months, but my recollection may be faulty). Realizing that College Town in Ithaca was a poor and improper place to put a telescope, he started house-hunting, and in just a few weeks had found and purchased a fine old brick house on twenty-five acres on Middaugh Road in Brooktondale, six or so miles east of Ithaca. Most of us who worked for him also moved in there, since the twenty-five thousand dollar price-tag on the real estate was a bit steep for him to manage without some people helping out by paying rent on rooms. We broke ground for the observatory in, as I recollect, summer of 1961. I must admit, I felt a little nervous over my role in this; I felt responsible for coaxing a friend, mentor, and employer into outlaying an overwhelming sum on one of my enthusiasms. But we dug, and poured foundations, and laid block, and built a pier just the same.

Some of the people involved were me, George Nagy, Steve (Steven Jon) King, Chuck Tappert, Dick Venezky, Trevor Barker, Dave Smith, our secretary Eirlys Tynan, her husband Mick, Dave Block, a math professor, and probably some others whose names I’ve lost in memory. Rosenblatt had the house painted white, and had a fireplace and chimney built, and bought a grand piano (he was an accomplished pianist, even though he could and did improvise endlessly on “Three Blind Mice”).

At some point in the construction of the observatory, the matter of telescope size came up in one of the not-infrequent bull sessions we would have after a day’s effort. I’m a little unsure of how we got there, but Rosenblatt boasted that he’d have at least a sixteen-inch scope on the premises within five years. The boast almost immediately became the subject of a wager: I bet him five bucks that this would not happen; he countered with an amendment that would give him a dollar an inch for anything over sixteen (I’d collect five dollars if I won, but would pay up to ten if I lost). We were, as I recollect, drinking beer at the time, and by and by I lost track of the details of the bet.

In 1962, I left staff at Cornell to return to university, this time at Syracuse. I endured a semester there, the assasination of a president, a sophomore slump, and in 1964 moved to Rochester and got married. My wife and I visited Rosenblatt in late 1964, and he made me another offer of employment, which I accepted. As 1965 came to a close, Rosenblatt remembered the bet, and, realizing he was about to lose it, got busy.

SETI was at that time something many were interested in. Most of the search at that time was being done at the 21-cm hydrogen wavelength. Townes’s ruby laser was, however, a very recent invention, and Rosenblatt reasoned that the search for extraterrestrial intelligence would be better done at visible wavelengths, where the very coherence of an electromagnetic wave would be a reasonable measure of intelligent origin of the signal. To this end, he designed a “Stellar Coherometer” which would ascertain whether emission in the spectra of stars was coherent radiation or not. He had connections with granting agencies, and was something (!) of a wheeler-dealer; he called up somebody he knew at NASA, I think it was, and, as it was near the end of their granting period and they had some unspent money in their vault, got told he could have seventy-five thousand dollars yankee for the project, which, by the way, included a twenty-inch telescope, to be housed on donated land at 119 Middaugh Road, Brooktondale, NY. Cornell would have to administer the grant, of course.

The Cornell administration took a look at it, and, horrified that the name of such a prestigious institution should be associated with such an outre` project, distanced themselves as far as possible from it. Rosenblatt, never one to give up easily (he climbed Mount McKinley once!) went across the valley to Ithaca College, and got himself an appointment as an associate professor there, so that they could administer the grant. They would get a fine research observatory out of the deal, so it was much to their liking. To the entire project’s extreme disappointment, however, the appointment took just a little longer than it should have, so that the grant period for that fiscal year had elapsed by the time it came through. The seventy-five thousand, unspent, reverted to the U. S. Treasury and was never seen again.

Part 2 (Originally published November 3, 2009)

Recollections of Rosenblatt: The following are anecdotal, some hearsay, some as told by Rosenblatt himself to me, and are my recollections; thus they may be faulty or incomplete.

He did both his undergraduate and graduate studies at Cornell, which was at that time not a very common practice; but Cornell has one of the most scenic campuses in the world, and I suppose that may have been a factor. He was a practical joker, and maintained that any practical joke must be well-tailored to its recipient. He and some others, whose names I never learned, drove to the town of Gibson one fine night, and stole the town’s “Gibson” signs. They trucked them back to the Psychology department in Morrill hall, and mounted them at the door of Professor Gibson’s office. When Robby MacLeod, department chairman when I was there, came in and saw them, he remarked to the department secretary, “Don’t you think Gibby’s getting a little ostentatious?” –The signs, according to Rosenblatt, were eventually cut up for use in various projects around the department.

Rosenblatt built a digital computer for the Psych department, I believe while he was a grad student there (?). It was named EPAC, for Electronic Processing and Analyzing Computer, and it was remarked that the vacuum tubes in it would serve to make about a hundred and fifty record players. When I was there in 1958, only scraps of the machine remained, but I don’t think anyone made record players out of it. 

Rosenblatt was enthused about digital computing, and spent a lot of time in the Cornell Computing Center. He used the machinery for reducing data in various rat-psychology experiments, and gained vast experience thereby. As I recollect, the first machine he worked on was a “Card-Programmed Calculator”, a pretty primitive machine about which I know little more than its name and the fact that its memory and its program were the deck of cards you put into it. The IBM 650 that Cornell later got was a big step up for them. It was a two-address magnetic-drum memory machine, which could add two bi-quinary ten-digit numbers in, as I recollect, two 96-microsecond cycles. General opinion when the 650 arrived was that it was pretty lame, and that such a prestigious institution should have had an IBM 704 instead (This is complaint passed on secondhand; I was not there until the university got their Burroughs 220), but for many years Cornell ran second-best in computing: the CPC when opinion said 650; the 650 when perhaps they should have had a 704; the Burroughs 220; instead of an IBM 7090, a CDC 1604, which was later replaced by an IBM System 360 Mod 67, whose multitasking operating system did not work for the first two years (as I recollect). Rosenblatt’s researches involved simulation of neural networks on these machines. His feeling was that a Perceptron needed to have a large associator layer; something above a thousand A-units was necessary to get most of the brain-like characteristics that the beast seemed to promise. Such simulations involved generating instructions to access and sum random cells in the machine’s “retina”; you’d write a program that generated as many “ADD” instructions as the unit had excitatory connections, and as many “SUB” instructions as it had inhibitory ones, and the process was repeated for each associator unit, so that for a large Perceptron you got a generated program that pushed the boundaries of available magnetic-core memories. Cycle-time was very important, therefore; at five thousand operations per second (Burroughs 220) you could eat up lots of computer hours before getting anything useful out, but forty-thousand (IBM 704) would make much faster work of it.

Rosenblatt’s PhD was inked in, I believe, 1954 or so, so that he was dealing with pretty primitive machinery. Under such circumstances, one had to be very efficient about one’s code. Rosenblatt, in order to get time on the various machines, became pretty adept at this, and once told me that out of sheer desperation, he re-wrote and optimized a sociology student’s CPC program so that he could then take over the ninety percent of the time she was using that the rewrite saved her.

He was pretty good friends with Bruce W. (Tuck) Knight, Jr., a physicist as I recollect; Tuck had spent some time at Los Alamos and Eniwetok. Tuck was braumeister of the Collegetown Old Undershirt brewery, and usually had a batch in the crock, one aging in the bottle, and the currently drinking batch. The recipe was simple: Six gallons of water, one can of malt (Blue Ribbon Hop-flavored Malt syrup), five pounds of sugar, and a teaspoon of salt; add yeast and let it ferment for a week, skimming off the foam as it boiled up; bottle in quarts with a teaspoonful of bottle sugar to carbonate it. Occasionally you’d get a pretty potable batch, but mostly it was not what you’d call premium beer, its virtue being that you could get drunk on it. It took its name from the fact that you covered the crock with an old undershirt to keep flies and other undesirable elements out of the fermenting brew. Tuck had brewed beer on Eniwetok, and at Los Alamos. He collaborated with Rosenblatt and Dave Block on a seminal paper on four-layer perceptrons, which was the predecessor to Rosenblatt’s analysis of the cross-coupled system, and which promised some interesting results as far as pre-processing of stimuli went.

Just before I went to work for Rosenblatt the first time, I brewed a batch in Syracuse. Some time later I brought a quart of it down to Tuck’s apartment, and the three of us broached, decanted, and drank it, and Tuck pronounced it acceptable, but insufficient. However, the only brew available at the time had been bottled only two days before. I think we consumed two or three quarts of the green stuff, and I have vague recollection of drawing an enormous abstraction in pencil on Tuck’s kitchen wall before staggering home; the following morning’s suffering convinced me that consumption of such stuff was not a good idea, and taught me the real meaning of “hangover”. Tuck and Frank, however, seemed little affected. I went down to Tuck’s to view my artwork some time later, when I was feeling a little better, but Tuck had already sponged it off the paint. Ah, well.

Occasionally when bottling the brew one got distracted, and dumped a second teaspoon of sugar into the quart one was working on. When this happened, an explosion was the sometimes result, delayed enough that usually people were away from the brewery at work or elsewhere when the explosion happened. Tuck told of one instance when they’d come back and found brown glass embedded in a wall some forty feet from the bottles, and a landlady complaining of a “brown sticky fluid” dripping through her ceiling.

At one time shortly before my arrival, one Hong-Yee Chiu, a physicist, camped at Tuck’s place for a while. Hong-Yee built his own perceptron, a fairly small system; he had a pretty good sense of humor, and Rosenblatt later showed me some of the A-units from it: Hong-Yee had drawn faces on each of the plaster-potted A-units. He decided at one point that Old Undershirt was not potent enough, and took a couple of quarts up to the Physics department in Rockefeller Hall, and distilled it. I did not see or taste it, but was told that he brought back a Coke bottle full of a noxious substance that tasted like fire and was somewhat dangerous to drink. Many years later, I ran into Hong-Yee’s daughter on a Caltech website. I think she was entirely unaware of her dad’s stint at Cornell.

Readers should feel free to correct any faulty recollections they may encounter here, with my thanks. –Terrell E. Koken

Handouts on ties in HS

The handouts on ties by John McCarthy and Kathryn Pruitt linked from the OT-Help 2 manual on p. 11 are now available from these links (the links in the manual are dead):

Did Frank Rosenblatt invent deep learning in 1962?

Deep learning (Le Cun et al. 2015: Nature) involves training neural networks with hidden layers, sometimes many levels deep. Frank Rosenblatt (1928-1971) is widely acknowledged as a pioneer in the training of neural networks, especially for his development of the perceptron update rule, a provably convergent procedure for training single layer feedforward networks. He is less widely acknowledged for his pioneering work with other network architectures, including multi-layer perceptrons, and models with connections “backwards” through the layers, as in recurrent neural nets. A colleague of Rosenblatt’s who prefers to remain anonymous points out that his “C-system” may even be a precursor to deep learning with convolutional networks (see esp. Rosenblatt 1967). Research on a range of perceptron architectures was presented in his 1962 book Principles of Neurodynamics, which was widely read by his contemporaries, and also by the next generation of neural network pioneers, who published the groundbreaking research of the 1980s. A useful concise overview of the work that Rosenblatt and his research group did can be found in Nagy (1991) (see also Tappert 2017). Useful accounts of the broader historical context can be found in Nilsson (2010) and Olazaran (1993, 1996).

In interviews, Yann Le Cun has noted the influence of Rosenblatt’s work, so I was surprised to find no citation of Rosenblatt (1962) in the Nature deep learning paper – it cites only Rosenblatt 1957, which has only single-layer nets. I was even more surprised to find perceptrons classified as single-layer architectures in Goodfellow et al.’s (2016) deep learning text (pp. 14-15, 27). Rosenblatt clearly regarded the single-layer model as just one kind of perceptron. The lack of citation for his work with multi-layer perceptrons seems to be quite widespread. Marcus’ (2012) New Yorker piece on deep learning classifies perceptrons as single-layer only, as does Wang and Raj’s (2017) history of deep learning. My reading of the current machine learning literature, and discussion with researchers in that area, suggests that the term “perceptron” is often taken to mean a single layer feedforward net.

I can think of three reasons that Rosenblatt’s work is sometimes not cited, and even miscited. The first is that Minsky and Papert’s (1969/1988) book is an analysis of single-layer perceptrons, and adopts the convention of referring to them as simply as perceptrons. The second is that the perceptron update rule is widely used under that name, and it applies only to single layer networks. The last is that Rosenblatt and his contemporaries were not very successful in their attempts at training multi-layer perceptrons. See Olazaran (1993, 1996) for in-depth discussion of the complicated and usually oversimplified history around the loss of interest in perceptrons in the later 1960s, and the subsequent development of backpropagation for the training of multilayer nets and resurgence of interest in the 1980s.

As for my question about whether Rosenblatt invented deep learning, that would depend on how one defines deep learning, and what one means by invention in this context. Tappert (2017), a student of Rosenblatt’s, makes a compelling case for naming him the father of deep learning based on an examination of the types of perceptron he was exploring, and comparison with modern practice. In the end, I’m less concerned with what we should call Rosenblatt with respect to deep learning, and more concerned with his work on multi-layer perceptrons and other architectures being cited appropriately and accurately. As an outsider to this field, I may well be making mistakes myself, and I would welcome any corrections.

Update August 25 2017: See Schmidhuber (2015) for an exhaustive technical history of Deep Learning. This is very useful, but it doesn’t look to me like he is appropriately citing Rosenblatt: see secs. 5.1 through 5.3. (as well as the refs. above, see Rosenblatt 1964 on the on the cat vision experiments).

Non-web available reference (ask me for a copy)

Olazaran, Mikel. 1993. A Sociological History of the Neural Network Controversy. Advances in Computers Vol. 37. Academic Press, Boston.


Tappert, Charles. 2017. Who is the father of deep learning? Slides from a presentation May 5th 2017 at PACE University, downloaded June 15th from the conference site. (Update July 2021: this has now been published in a conference proceedings paper, and is cited in a much-improved wikipedia entry for Rosenblatt).

Rosenblatt, with the image sensor of the Mark I Perceptron (Source: Arvin Calspan Advanced Technology Center; Hecht-Nielsen, R. Neurocomputing (Reading, Mass.: Addison-Wesley, 1990).)

The Mark 1 Perceptron (Source: Arvin Calspan Advanced Technology Center; Hecht-Nielsen, R. Neurocomputing (Reading, Mass.: Addison-Wesley, 1990).)

Conference on Computational Approaches to Linguistics?

A group of us have recently been discussing the possibility of a new conference on computational approaches to linguistics (group=Rajesh Bhatt, Brian Dillon, Gaja Jarosz, Giorgio Magri, Claire Moore-Cantwell, Joe Pater, Brian Smith, and Kristine Yu). We’ll provide some of the content of that discussion in a moment (we=Gaja and Joe), but the main question we’d like to get on the table is where the first meeting of that conference should be held. It’s so far agreed that it should be co-located with some other event to increase participation (at least for the first meeting), and the end of 2017 / beginning 2018 seems like the right time to do it. The ideas currently under discussion are:

  1. In conjunction with the Annual Meeting on Phonology in New York in early fall 2017. (We haven’t approached the organizers about this).
  2. In conjunction with a one-time workshop on computational modeling of language planned for fall of 2017 at UMass (invited speakers, pending funding, include Jacob Andreas, Emily Bender, Sam Bowman, Chris Dyer, Jason Eisner, Bob Frank, Matt Goldrick, Sharon Goldwater, and Paul Smolensky).
  3. As a “Sister Society” at the LSA general meeting 4-7 January in Salt Lake City (we have had preliminary discussions with the LSA and this seems very straightforward)

We’d very much appreciate your thoughts on the location or the substance of the conference as comments below, or use this google form to give a non-public response.

The original idea was to start a computational phonology conference, inspired by the success of the informal meetings that we’ve had as the North East Computational Phonology Circle, and by the central place that computational work has in phonology these days. But Giorgio pointed out that a broader meeting might well be of interest, and we seem to have come to a consensus that he’s likely right. It doesn’t seem like there is a general venue for computational linguistics of the non-engineering-focused kind, though we are aware of successful workshops that have been held at the ACL and elsewhere (e.g. Sigmorphon, MOL, CMCL). These workshops are in fact also part of the inspiration for this; however, the conference we envision would be broader in scope and co-located with a major linguistics conference to attract as many linguists as possible, minimize costs, and minimize additional conference travel.

We still think that a core contingent might well be the computational phonologists, especially at first, so we still think co-locating it with AMP might make sense (plus NYC is a good location). We’ve also had suggestions that we might in some years co-locate with other conferences, like NELS – the location of future meetings is something we could discuss in person at the first one.

We also seem to have come to a current consensus that we’d like to have reviewed short papers in the CS / CogSci tradition. This is an extremely efficient way to get research out. The one worry that was expressed was that this may create a barrier to later journal publication, but at least two journals with explicit policies on this (Cognitive Science and Phonology) allow publication of elaborated versions of earlier published conference papers.

Please share this post or the tweet below!

What’s Harmony?

From an e-mail from Paul Smolensky, March 28, 2015. Even though he wasn’t doing phonology in the mid-1980’s when he coined the term “Harmony Theory”, Paul had apparently taken a course on phonology with Jorge Hankamer and found vowel harmony fascinating.

 “Harmony” in “Harmony Theory” arises from the fact that the Harmony function is a measure of *compatibility*; the particular word was inspired by vowel harmony, and by the letter ‘H’ which is used in physics for the Hamiltonian or energy function, which plays in statistical mechanics the same mathematical role that the Harmony function plays in Harmony theory: i.e., the function F such that prob(x) = k*exp(F(x)/T).
(Although I took the liberty of changing the sign of the function; in physics, it’s p(x) = k*exp(–H(x)/T), in Harmony Theory, it’s p(x) = k*exp(H(x)/T). That’s because it drove me crazy working in statistical mechanics that that minus sign kept coming and going and coming and going from equation to equation, leading to countless errors; I just dispensed with it at the outset and cut all that nonsense off at the pass.)
From an e-mail from Mark Johnson Jan. 16th, 2016:
I always thought the reason why the physicists had a minus sign in the exponential was that otherwise temperatures would have to be negative.  But I guess you can push the negation into the Hamiltonian, which is perhaps what Paul did.
From an e-mail from Paul Smolensky, Feb. 10th, 2016:
Yes, that’s just what I did. Instead of minimizing badness I switched to maximizing goodness. I’m just that kind of guy.
From an e-mail from Mark Johnson, Feb. 10th, 2016:

Probabilities are never greater than one, so log probabilities are always less than or equal to zero.  So a negative log likelihood is always a positive quantity, and smaller negative log likelihood values are associated with more likely outcomes.  So one way to understand the minus sign in the Gibbs-Boltzmann distribution is that it makes H(x) correspond to a negative log likelihood.

But I think one can give a more detailed explanation.

In a Gibbs-Boltzmann distribution p(x) = k*exp(–H(x)/T), H(x) is the energy of a configuration x.

Because energies H(x) are non-negative (which follows from the definition of energy?), and given a couple of other assumptions (e.g., that there are an infinite number of configurations and energies are unbounded — maybe other assumptions will do?), it follows that probability must decrease with energy, otherwise the inverse partition function k would not exist (i.e., the probability distribution p(x) would not sum to 1).

So if the minus sign were not there, the temperature T (which relates energy and probability) would need to be negative.  There’s no mathematical reason why we couldn’t allow negative temperatures, but the minus sign makes the factor T in the formula correspond much closer with our conventional understanding of temperature.

In fact, I think it is amazing that the constant T in the Gibbs-Boltzmann formula denotes exactly the pre-statistical mechanics concept of temperature (well, absolute temperature in Kelvin).  In many other domains there’s a complex relationship between a physical quantity and our perception of it; what is the chance of a simple linear relationship like this for temperature?

But perhaps it’s not a huge coincidence.  Often our perceptual quantities are logarithmically related to physical quantities, so perhaps its no accident that T is inside the exp() rather than outside (where it would show up as an “exponential temperature” term).  And the concept of temperature we had before Gibbs and Boltzmann wasn’t just a naive perception of warmth; there had been several centuries of careful empirical work on properties of gases, heat engines, etc., which presumably lead scientists to the right notion of temperature well before the Gibbs-Boltzmann relationship was discovered.

From an e-mail from Paul Smolensky March 27, 2016:
Here are some quick thoughts.
0. Energy E in physics is positive. That’s what forces the minus sign in p(x) \propto exp(—E(x)/T), as Mark observes.

Assuming x ranges over an infinite state space, the probability distribution can only be normalized to sum to one if the exponent approaches zero as x -> infinity, and if E(x) > 0 and T > 0, this can only happen if E(x) -> infinity as x -> infinity and we have the minus sign in the exponent.

1. Why is physical E > 0?

2. Perhaps the most fundamental property of E is that it is conserved: E(x(t)) = constant, as the state of an isolated physical system x(t) evolves in time t. From that point of view there’s no reason that E > 0; any constant value would do.

3. For a mechanical system, E = K + V, the sum of the kinetic energy K derived from the motion of the massive bodies in the system and the potential energy V. Given Newton’s second law, F = ma = m dv/dt, E is conserved when F = — grad V and K = mv^2/2

then dE/dt = d(mv(t)^2/2)/dt + dV(x(t))/dt = mv dv/dt + dx/dt . grad V = v(ma) + v(—F) = 0; that’s where the — sign in —grad V comes from.

Everything in the equation E = K + V could be inverted, multiplied by —1, without change in the conservation law. But the commonsense meaning of “energy” is something that should increase with v, hence K = mv^2/2 rather than —mv^2/2.

4. Although K = mv^2/2 > 0, V is often negative.

E.g., for the gravitational force centered at x = 0, F(x) = —GmM x/|x|^3  = —grad V if V(x) = —GmM/|x| < 0
(any constant c can be added to this definition of V without consequence; but even so, for sufficiently small x, V(x) < 0)
Qualitatively: gravitational force is attractive, directed to the origin in this case, and this force is —grad V, so grad V must point away from the origin, so V must increase as x increases, i.e., must decrease as x decreases. V must fall as 1/|x| in order for F to fall as 1/|x|^2 so the decrease in V as x —> 0 must take V to minus infinity.

5. In the cognitive context, it’s not clear there’s anything corresponding to the kinetic energy of massive bodies. So it’s not clear there’s anything to fix a part of E to be positive; flipping E by multiplication by —1 doesn’t seem to violate any intuitions. Then, assuming we keep T > 0, we can (must) drop the — in p(x) \propto exp(—E(x)/T) = exp(H(x)/T) where we define Harmony as H = —E. Now the probability of x increases with H(x); lower H is avoided, hence higher H is “better”, hence the commonsense meaning of “Harmony” has the right polarity.

E-mail from Mark Johnson March 27, 2016

Very nice!  I was thinking about kinetic energy, but yes, potential energy (such as gravitational energy) is typically conceived as negative (I remember my high school physics class, where we thought of gravitational fields as “wells”).  I never thought about how this is forced once kinetic energy is positive.

Continuing in this vein, there are a couple of other obvious questions once one thinks about the relationship between Harmony theory and exponential models in physics.

For example, does the temperature T have any cognitive interpretation?  That is, is there some macroscopic property of a cognitive system that T represents?

More generally, in statistical mechanics the number (or more precisely, the density) of possible states or configurations varies as a function of their energy, and there are so many more higher energy states than lower energy ones that the typical or expected value of a physical quantity like pressure is not that of the more probable low energy states, but instead determined by the more numerous, less probable higher energy states.

I’d be extremely interested to hear if Paul knows of any cases where this or something like it occurs in cognitive science.  I’ve been looking for convincing cases ever since I got interested in Bayesian learning!  The one case I know of has to do with “sparse Dirichlet priors”, and it’s not exactly overwhelming.

E-mail from Paul Smolensky, March 27, 2016

The absolute magnitude of T has no significance unless the absolute magnitude of H does, which I doubt. So I’d take Mark’s question about T to come down to something like: what’s the cognitive significance of T —> 0 or T —> infinity or T ~ O(1)?

And I’d look for answers in terms of the cognitive role of different types of inference. T —> 0 gives maximum-likelihood inference; T —> infinity gives uniform sampling; T ~ O(1) gives sampling from the distribution exp(H(x)).  Mark, you’re in a better position to interpret the cognitive significance of such inference patterns.

As for the question of density of states of different Harmony/energy, the (log) density of states is essentially the entropy, so any cognitive significance entropy may have — e.g., entropy reduction as predictor of incremental sentence processing difficulty à la Hale — qualifies as cognitive relevance of density of states. As for the average value of a quantity reflecting less-probable-but-more-numerous states more than more-probable states, I’m not sure what the cognitive significance of average values is in general.

Worst abstract review ever

“No data, yet combines two or more of the worst phonological theories, resulting in an account that is far more complicated and assumption-laden than the simple if typologically odd pseudo-example given.”

I received this review on an abstract I submitted recently. I’ve gotten plenty of bad reviews in the sense of them being negative, but I’ve never gotten one that was so unprofessional, and that made it so clear that the reviewer hadn’t engaged with the abstract in anything but the most superficial fashion. Because I didn’t think this reviewer was doing their job, I was moved to complain about it. I did so as follows:

“I’ve never complained about a conference review before, but this is one’s beyond the pale. I don’t want you to do anything about it, but I had to tell you I’m pretty shocked by it.”

The conference organizer reported that the program committee agreed that the review was unprofessional, and that this reviewer, along with another who had engaged in “soapboxing or axe-grinding”, would not be included in the list of reviewers passed on to the next year’s organizer.

I was pleased with this outcome, and I thought I’d tell this story because this seemed like a good way of improving the quality of reviewer pools that others might usefully adopt. I’d also be happy if this contributed to a general discussion of what the expectations are for reviews, and how we can make them better.

Moreton, Pater and Pertsova in Cognitive Science

The nearly final version of our Phonological Concept Learning paper, to appear in Cognitive Science, is now available here. The abstract is below, and we very much welcome further discussion, either by e-mail to the authors (addresses on the first page of the paper), or as comments to this post.


Linguistic and non-linguistic pattern learning have been studied separately, but we argue for a com- parative approach. Analogous inductive problems arise in phonological and visual pattern learning. Evidence from three experiments shows that human learners can solve them in analogous ways, and that human performance in both cases can be captured by the same models.

We test GMECCS, an implementation of the Configural Cue Model (Gluck & Bower, 1988a) in a Maximum Entropy phonotactic-learning framework (Goldwater & Johnson, 2003; Hayes & Wilson, 2008) with a single free parameter, against the alternative hypothesis that learners seek featurally- simple algebraic rules (“rule-seeking”). We study the full typology of patterns introduced by Shepard, Hovland, and Jenkins (1961) (“SHJ”), instantiated as both phonotactic patterns and visual analogues, using unsupervised training.

Unlike SHJ, Experiments 1 and 2 found that both phonotactic and visual patterns that depended on fewer features could be more difficult than those that depended on more features, as predicted by GMECCS but not by rule-seeking. GMECCS also correctly predicted performance differences between stimulus subclasses within each pattern. A third experiment tried supervised training (which can fa- cilitate rule-seeking in visual learning) to elicit simple-rule-seeking phonotactic learning, but cue-based behavior persisted.

We conclude that similar cue-based cognitive processes are available for phonological and visual concept learning, and hence that studying either kind of learning can lead to significant insights about the other.

Calamaro and Jarosz on Synthetic Learner blog

On the Synthetic Learner blog, Emmanuel Dupoux recently posted some comments on a paper co-authored by Gaja Jarosz and Shira Calamaro that recently appeared in Cognitive Science. Gaja has also written a reply. While you are there, take a peek around the blog, and the Bootphon website: Dupoux has a big and very interesting project on unsupervised learning of words and phonological categories from the speech stream.

Wellformedness = probability?

There are some old arguments against probabilistic models as models of language, but these do not seem to have much force anymore, especially because we now have models that can compute probabilities over the same representations that we use in generative linguistics (Andries Coetzee and I have an overview of probabilistic models of phonology in our Handbook chapter, Mark Johnson has a nice explanation of the development of MaxEnt models and how they differ from PCFG’s as well as other useful material on probabilistic models as models of language learning, Steve Abney has a provocative and useful piece about how the goals of statistical computational linguistics can be seen as the goals of generative linguistics; see more broadly the recent debate between Chomsky and Peter Norvig on probabilistic approaches to AI; see also the Probabilistic Linguistics book and Charles Yang’s review).

That’s not to say that there can’t be issues in formalizing probabilistic models of language. In a paper to appear in Phonology (available here) Robert Daland discusses issues that can arise in defining a probability distribution over the infinite set of possible words, in particular with Hayes and Wilson’s (2008) MaxEnt phonotactic grammar model. In the general case, for this to succeed, the probability of strings of increasing length must decrease sharply enough such that the sum of their probabilities never exceeds 1, and simply continues to approach it. Daland defines the conditions under which this will obtain in the Hayes and Wilson model in terms of the requirements on the weight of a *Struc constraint that assigns a penalty that increases as string length increases.

In the question period after Robert’s presentation of this material at the GLOW computational phonology workshop in Paris in April, Jeff Heinz raised an objection against the general notion of formalizing well-formedness in terms of probabilities, and he repeated this argument at the Manchester fringe workshop last week. Here’s my reconstruction of it (hopefully Jeff will correct me if I get it wrong – I also don’t have the references to the earlier work that made this argument). Take a (relatively) ill-formed short string. Give it some probability. Now take a (relatively) well-formed string. Give it some probability. Now concatenate the well-formed string enough times until the whole thing has probability lower than the ill-formed string, which it eventually will.

This is meant to be a paradox for the view that we can formalize well-formedness in terms of probabilities: the long well-formed string has probability lower than the short ill-formed string. It’s not clear to me, however, that there is a problem (and it wasn’t clear to Robert Daland either – the question period discussion lasted well into lunch, with Ewan Dunbar taking up Jeff’s position at our end of the table). Notice that Jeff’s argument is making an empirical claim that the concatenation of the well-formed strings does not result in a well-formedness decrease. When I talked to him last week, he claimed that this is clearer in syntax than phonology. Robert’s position (which I agree with) is that it likely does – though from his review of the literature on phonotactic well-formedness judgments we don’t seem to have empirical data on this point.

Robert asked us to work with him in designing the experiment, and at the time I wasn’t sure that this was the best use of our lunch time, but I think he has a point. If this is in fact an empirical issue, and we can agree beforehand on how to test it, then this would save a lot of time compared with the usual process of the advocates of one position designing an experiment, which even if it turns out the way they hope, can then be criticized by the advocates of the other position as not having operationalized their claim properly, and so on…

It’s also of course possible that this is not an empirical issue: that there is a concept of perfect well-formedness that probabilistic models cannot capture. This reminds me of a comment on a talk I got once from a prominent syntactician when I discussed probabilistic models that can give probability vanishingly close to zero to ill-formed structures: “but there are sentences that I judge as completely out for English – they should have probability zero”. My response was to simply repeat the phrase vanishingly close to zero, and check to make sure he knew what I meant.