*From Joe Pater*

The commentaries on my paper “Generative Linguistics and Neural Networks at 60: Foundation, Friction and Fusion” are all now posted on-line at the authors’ websites at the links below. The linked version of my paper and – I presume – of the commentaries are the non-copyedited but otherwise final versions that will appear in the March 2019 volume of *Language* in the Perspectives section.

I decided not to write a reply to the commentaries, since they nicely illustrate a range of possible responses to the target article, and because most of what I would have written in a reply would have been to repeat or elaborate on points that are already in my paper. But there is of course lots more to talk about, so I thought I’d set up this blog post with open comments to allow further relatively well-archived discussion to continue.

Iris Berent and Gary Marcus. No integration without structured representations: reply to Pater.

Ewan Dunbar. Generative grammar, neural networks, and the implementational mapping problem.

Tal Linzen. What can linguistics and deep learning contribute to each other?

Lisa Pearl. Fusion is great, and interpretable fusion could be exciting for theory generation.

Chris Potts. A case for deep learning in semantics

Jonathan Rawski and Jeff Heinz. No Free Lunch in Linguistics or Machine Learning.

Re. Berent and Marcus: Is there public write-up of “How we reason about innateness”? It’s cited as evidence for a fairly bold claim (“resistance to innate ideas could well be grounded in core cognition itself”), but all I can find is a talk with no associated paper.

Hopefully, soon. It’s under review (revised and resubmit).

Thanks! Curious to see if you’re comfortable sharing privately. (No worries if not.)

(Porting over from Twitter)

Either I don’t get the spirit of the Rawski & Heinz response, or it misses an easy opportunity to draw parallels between symbolic and representation learning-based approaches to language. They claim that “any serious scientific application of neural architectures within linguistics must always strive to make the learner’s biases transparent,” with the strong implication that this isn’t being done.

Specifying the model architecture and learning algorithm used for an NN model specifies the model’s bias, no? Nearly every paper in this literature gives that key information, and many, many CL papers discuss the consequences of the specifications for what is learnable easily and what is learnable at all. Very little of this discussion is couched in the language of learning theory and the Chomsky Hierarchy, but the discussion is absolutely happening. I see a clear opportunity for bridge-building here, but I don’t see a clear failing in current work unless you take that body of theory as a particularly privileged approach to the science of language.

(Of course, the original article could have commented on this too, but I don’t think it was clearly called for.)

Reposted from Twitter:

The Berent & @Marcus response summarizes familiar arguments against 1980s-style connectionism, but it would have been useful to see a discussion of more recent work, e.g. recent neural architectures with structured inductive biases (syntactic, relational, compositional, etc), which I assume would not be taken to follow the “associationist hypothesis” (from Chris Dyer, Jacob Andreas, Richard Socher, Sam Bowman), Kirov and Cotterell experimental work showing that modern seq2seq networks (without explicit algebraic representations) can in fact learn the English past tense (https://arxiv.org/abs/1807.04783), etc, etc. The only recent papers that do get mentioned are ones that support the authors’ argument – @LakeBrenden & Baroni’s (very cool) experimental work that demonstrates lack of systematicity in standard seq2seq networks (https://arxiv.org/abs/1711.00350 and https://arxiv.org/abs/1807.07545).

In any case I agree with Berent & Marcus that (1) the goal is to create a model that generalizes like humans, and (2) to get there we need to run experiments on both models and humans, and if necessary add different/stronger inductive biases (nothing controversial here).

Thanks Tal! Just in case someone reading your post hasn’t read the paper on which Berent and Marcus were commenting, I should point out that I tried to have a balanced discussion there on the issue of whether explicit linguistic representations, including symbols, are needed in neural net models of language, and included Kirov and Cotterell as an example of an interesting recent result that suggests that current architectures can do more without variables than earlier ones could. It seems like from Berent and Marcus’ perspective, I was leaning too far in the direction of endorsing symbol-free models. The only major thing missing from the commentaries, I think, is someone from the other side, arguing that I was being too optimistic about the need for explicit linguistic representations.

To summarize my main points: (1) current practice in the neural network world has moved beyond what Gary has termed eliminative connectionism, and many “deep learning” systems have components that could qualify as symbols, variables and compositional representations; (2) discussion of the abilities and limitations of neural networks should make reference to specific experimental results obtained on specific neural network architectures.

Pater presented a narrative about language-learning over the past sixty years, focusing on neural networks and generative grammar. Our point is that any such narrative which excludes, as Pater’s does, computational learning theory and mathematical theories of string (and tree and graph) languages misses the forest for the trees. This is especially true for the fusion Pater dreams of. The only way that will occur, as Dunbar argues so cogently, is by bridging the mapping gap, which requires the ability to reason about NNs at the right level of abstraction. This is exactly what computational and mathematical theories of languages and language-learning provide.

Bowman asks “Specifying the model architecture and learning algorithm used for a NN model specifies the model’s bias, no?” The answer is No. A replicable program does not entail that its bias is analyzable or transparent. Such a specification is a necessary, but not a sufficient, condition. No one would use a sorting algorithm in a software library if there was no proof of its correctness. The proof of correctness may be derived from a specification of the program, but the program itself is not sufficient to make clear what problem it is solving. Computer science is about problems and the algorithms that solve them reliably, correctly, and with so much resources. Again, pace Dunbar, what problem is the specified NN solving, regardless of the task?

We agree there is an opportunity for bridge building, and we pointed to recent work that makes these connections. We encourage more work of this sort.

Finally, Pater’s article is about *generative* linguistics and neural networks. Our p.4 clearly showed this body of theory is central to both, so it is privileged in this context. One can reject generative grammar, but then you’re not talking about Pater’s paper nor ours. Also, the body of theory we discuss is a mathematical theory of language and language-learning. So it is generally privileged to the science of language and language-learning in the same way mathematical analysis is privileged in any other scientific endeavor. Many topics it encompasses, such as logic, automata theory, and the learning of grammars expressed with logic and automata, will be with us for centuries to come.

Jon & Jeff

One thought related to some of the above. I am very sympathetic to the desire to have our neural nets be interpretable, and also to have analytic results about representability and learnability that could hook up with other results in mathematical and computational theory. But it doesn’t follow that we should dismiss, or ignore, research that does not meet those desiderata.

Say, for example, that we want a model of how humans learn and represent semi-regular morphophonology. I think it would be a mistake to ignore Kirov and Cotterell’s research, just because it doesn’t meet the above desiderata.