New NSF Grant on Computational Phonology to Jarosz and Pater

Joe Pater (PI) and Gaja Jarosz (co-PI) have been awarded an NSF research grant on “Representing and learning stress: Grammatical constraints and neural networks” (NSF 2140826 $386,226). This three-year research grant will study the learnability of a wide range of word stress patterns, using two general approaches. In one, general purpose learning algorithms will be employed with representational hypotheses developed in linguistics. The goal will be to develop grammar+learning systems that can cope with a broader range of typological data than current models, and that can also handle more of the details of individual languages, using more realistic data to learn from. In the other, neural networks, which lack prespecified linguistic structure, will be tested on their ability to learn these same patterns, and to generalize appropriately. The public summary is below.

Public summary: Languages are systems of remarkable complexity, and linguists and computer scientists have devoted considerable effort to the development of methods for representing those complex systems, as well as computational methods for learning the system of a given language. This effort is driven by the desires to better understand human cognition, and to build better language technologies. This project draws on the theories and methods of both linguistics and computer science to study the learning of word stress, the pattern of relative prominence of the syllables in a word. The stress systems of the world’s languages are relatively well described, and there are competing linguistic theories of how they are represented. This project applies learning methods from computer science to find new evidence to distinguish the competing linguistic theories. It also examines systems of language representation that have been developed in computer science and have received relatively little attention by linguists (neural networks). The research will engage undergraduate and graduate linguistics students at a public university. Linguistics has a much higher proportion of female students than computer science, and this project aims to address gender imbalance in STEM.

From a linguistic perspective, learning stress involves learning hidden structure, parts of the representation that are not present in the observed data and that must be inferred by the learner. A given pattern of prominence over syllables is often consistent with multiple prosodic representations. The approach to hidden structure learning used in this project applies the general technique of Expectation Maximization, which in pilot work achieved good results on a standard test set. Intriguingly, many of the languages that this learner failed on in the test set are ones that are in fact cross-linguistically unattested. This project expands the set of tested languages to include more of the range of systems found cross-linguistically, and further explores the possibility that typological gaps have learning explanations. It compares hypotheses about the constraints responsible for stress placement by comparing how well they support the learning of attested systems, and whether they can help explain typological gaps. Pilot work also found indications that a neural network could learn generalizable representations of the data; the project is further testing this method. All of the software developed in this project is being made freely available, as is a database of the stress systems of the world’s languages.