Old English Parser 5

 

 

Update 5/2017. Now that the static state parser is working, time to move on to the syntax. The next big issue is language. So far, I’ve written in PERL. Love it. Works brilliantly. But much of the available open-source code is in python. NLTK is in python. You can tokenize with it. Parse with it. The Helsinki parsed OE corpus is in it. And it loads into a script easy-peasy. Very tempting to switch horses in mid stream.

Second issue: I’m moving the parser from UNIX to LINUX.

I’m putting the parser on a sandboxed Dell that’s about 10 years old and about $150 on Amazon. Fine computer that still runs Ubuntu! As an aside, I can configure the IP tables to reject traffic from China, which cuts back significantly on lost processor time spent dealing with robot hack attempts every 2 seconds. Here hacker “tomcat” is calling from 106.39.44.0, which is Chinanet, No.31, Jingrong Street, Beijing

Two seconds later, the next caller, 116.31.116.7 at port 36116:11, is from the same server. Between this sort of nonsense and the spam, it’s a wonder academic computing can do anything else.

Now that the basic parser is working, it needs fine-tuning. First, the lists of closed-class words need to be improved. Pronouns are especially difficult to differentiate, and I wonder if I need to. Relative, interrogative, demonstrative—are these required to assemble phrases?

Second, the list of PRP is too short. Mitchell & Robinson have a good list in their textbook.

Third, the order of assembly for the syntax portion needs some thought. Obviously, closed class words must come first. After pronouns, prepositional phrases might be easiest to isolate. The trick is to try to think like an undergrad: what were the first steps? Small closed-class words (PRN, CNJ, PRP), then obvious inflections (-um, -ode, -ost, etc).

Update 6/20/2017. It’s working. Parser finds possible inflections and declensions based on suffixes, then cheks for closed class, then checks for prepositional phrases. Still a long way to go. Noun phrases, the right bounary of PP, suboptimal parses, and so on. But it’s working. Here it is: http://www.bede.net/misc/dublin/parse.html

Old English Parser 4

The next step is to isolate syntactic clusters. These clusters comprise phrases, words in a series, and so forth. An example is a prepositional phrase (PP). A PP is made up of a preposition (PRP) and an object noun. It may also contain any number of modifiers (adjectives, determiners, pronouns) and conjunctions (CNJ). So, “I saw the house of oak and stone.” The PRP is “of,” the CNJ is “and,” and both “oak” and “stone” are dative object nouns—called objects of the preposition. We read from left to right in English. So we read “of” then process the rest of the PP. A parser should do the same.

In Old English, we also read from left to right. Consequently, the position of each word in a sentence and a phrase is important. Although OE poetry allows for significant hyperbaton (messing up the syntax), it is rare that a preposition would sit after its object(s). In fact, prepositons are relatively unusual in poetry since a dative or accusative inflection already gives that information to the reader. Consequently, the sentence will be held in an indexed array. The left-most word will be read in first and assigned the lowest number.

The first syntactic cluster I’ll attempt is the prepositional phrase. The reason to isolate this phrase first is that neither the main verb nor the subject or the direct object of a sentence can be placed in a PP. By knocking the PP out of contention, we have a better chance at fiding the subject, object, and verb.

A big challenge here is how to present the results.

Here’s one option:

PP | of [PRP] oak [N{dat}] and [CNJ] stone [N{dat}]

That presentation can then be turned into html. The sentence could be printed and the PP appear in, say, green type. And when a user hovers over a word, its grammatical information could be revealed. But that’s inelegant, since the user has to interact with the data in order to make it productive. (The user already knows the sentence, so why return it again?)

Another option is to present the data in a table (which I can’t do easily in this blog).

PP | PRP | Nd | CNJ | Nd

…..| of | oak | and | stone

The cells of the table can then be colored green for a PP. In this case, the important data is returned first (the parsed results), and the sentence is returned as a reminder to the user.

Because the parser is intended to return optimal and suboptimal results, the table will have to have room for all of them. At the moment, I’m thinking of returning each complete parse as a table row, beginning with the optimal result, then proceeding downwards to the least optimal result. Here’s an example.

Wiht
unhælo
grim
CNJ
ond
grædig
gearo
sona
wæs
reoc
CNJ
ond
reþe
CNJ
ond
PRP
on
ræste
genam
þritig
þegna
þanon
ADV
eft
gewat
huðe
hremig
PRP
to
ham
faran
PRP
mid
ART
þære
wælfylle
wica
neosan

A sentence from Beowulf with the conjunctions (CNJ), prepositions (PRP), some adverbs (ADV), and articles/determiners (ART/DET) marked. You’ll notice that sona is an adverb, as is þanon. The old parser missed those.

Meanwhile, I have only just discovered a NLP parser from the CS Dept at UMass. It is called Factorie (Home Page). It appears to use the NLP server familiar to users of python. Although it operates on North-American English, it can be trained to parse other languages. That training may be more time-intensive than writing my own parser.

Old English Parser 3

matrix-wallpaper-31.jpgOnce the parser delivers all possibilities to the calling function, it’s time to decide what’s what. What is a verb, what is  noun, and so on. Because language is not mathematics, there is rarely a single answer. Instead, there are better and worse answers. In short, all possible answers need to be ranked. The best answer(s) are called “optimal.”

As far as I know, current iterations of the parsed corpora of OE treat all OE utterances as equivalent. In other words, the syntax of a sentence from the poem Beowulf is as grammatical as a sentence from Aelfric’s sermons. Bruce Mitchell, the greatest of OE syntacticians, compiled his conclusions from an undifferentiated variety of sources: “I have tried to select a variety of relevant examples or to illustrate the phenomenon from Aelfric and Beowulf, as best suited my purpose” (OE Syntax, p. lxiii). Although he noticed distinctions in syntax between authors, he attempted to derive from them a single form. For example, swelc ‘such’ is a strong, demonstrative pronoun in most OE writing; but it is used in a weak form “in ‘Alfredian’ prose” (sec. 504). Mitchell chased an abstract description of syntactic phenomena, whose instantiated form in language he illustrated with examples. Variants were classed as deviations from a norm. His was a Platonic quest. Mine, much impoverished by comparison, shall be a touch more Aristotelian.

Syntax will be subdivided generically. What is optimal in a prose chronicle may not be optimal in a poem.

Syntax will be subdivided temporally. What was optimal in Alfred’s court may not have been optimal in Cnut’s.

And syntax will be subdivided spatially, or rather, geographically. What was optimal in Anglian may not have been optimal in Kentish.

These subdivissions will have to be rough-and-ready, since we lack diplomatic editions of manuscripts. Editors have already elided much of the information that distinguishes one scribe’s work from another’s. For example, I noted in a manuscript of Aelfric an instance of OE faðer ‘father’, which was not supposed to exist at the time. As diplomatic editions come available, I will be able to account for them. For the moment, any dated or datable text will be marked as such. The “syntaxer” (program that parses syntax) will prefer texts of similar genre, locale, and time.

INDICES. The major component to the project is a set of massive indices.

Like a Google search, this parser operates on tables of frequencies. Google digests the raw web daily, at least. The web is then sifted through algorithms at Google’s massive data centers. That sifting process results in massive indices of frequencies. An index will record searches, links, and clicks. A search for Silly Putty is recorded. A second search. Then a third. Three users click on a single link. That link is recorded and marked as the top link. The next user to search for Silly Putty is sent the top link first.

An Aelfric index. By parsing the works of Aelfric, the computer can build a list of most-used words and their usual grammatical function in Aelfric’s sentences. So, someone searching for heald in Aelfric will prompt the retun of pre-parsed sentences, rather than invoking a new search. The sentences will come in order: the most usual use of heald first, the least usual last. (It seems most often to be in participial form.)

Other indices are obvious. A Wulfstan index. The works of King Alfred’s court. Benedictine books. The Chronicle. Poetry. And so forth.

PROGRESS:

Meanwhile, the parser is taking a sentence and parsing it!

The first test sentence was, of course, “Hello, World!” Here is the screen grab of the first Old English sentence to work:

screen_sentence

Se man wæs god. The correct forms are listed in each cluster.

Anglo-Saxon lyre 3

Last day on the lyre. Zither pins go in when I decide on 4 strings or 6. Just a few odds and ends left. (Latin camp was excellent. We went from Proto-Indo-European to Old Latin, looked at theme vowels in the various inflections and decelnsions, and tried to made sense of the various phonological categories of stem vowels. By the end of the second day, we were reading Bede’s account of Caedmon!)

First, I painted the dragon heads with a terra-cotta base in preparation for gilding.

pic_lyre_twoHeads1

My penurious Scots soul wouldn’t allow me to spend heavily on real gold, so I bought a cheap-o set of gilding materials from the Mona Lisa company. The glue was terrible, and the gilding is a composite, so it doesn’t act like real leaf. (I may get real leaf later.) Nevertheless, it turned out alright.

pic_lyre_gildedHead

Next came the banding. I cut down strips of basswood over which I laid walnut-and-beech banding. I mitered it in a hobby miter box with a fine-toothed Japanese pull saw designed for dovetails. Glued with hide glue—very important, since it dries slowly and the miters needed readjusting quite a bit.

pic_lyre_banding

Then I fit the cross-piece into the heads with epoxy. Mighty strong glue. It’s hard to see from the picture, but the face of the cross-piece is dead center along the x-axis of the lyre. The force of the strings will pull down through the center of the heads, through the center of the posts, and onto the footrests I carved in the frame.

pic_lyre_twoHeads2

The strings attach to a peg. So it was time to make the peg. The guitar strings I’m using are attached by knots that are similar to a noose. So it seemed perfectly fitting to carve the Hanged God, Odin. Although the second picture is out of focus, it shows what a little linseed oil does to beautify the wood. One addition: I wrapped a copper wire twice around Odin’s neck and secured it. The strings then go under the wire, knotted at one end. It keeps them in place with room for all.

pic_lyre_wodenPeg

pic_lyre_wodenPeg2

And the (almost-) finished lyre:

lyre_finished1

 

 

Here it is oiled and waxed, with mother-of-pearl inset into the supports, with the strings on. Tuned to the tonic of D with bass strings at E and A. It works with a glass slide, too.

Full Lyre

Sound

Summer of 2016 will see a second lyre. This time, the back and front will be made from canary wood.

Anglo-Saxon Lyre 2

Stanchions glued in. Used yellow glue rather than hide glue since they are structural. They stood proud of the side. Foolishly, I used a #4 plane, which is as big as the lyre box, to bring down the posts. Only half-way through my first coffee, so naturally I slipped and took a chunk out of the base. Squared the damage with a chisel and inset a piece of rosewood. If I ever need a pick-up, this is where it will go. Lesson learned: small tools for small jobs.

pic_lyre_busted

 

So, brought down the posts with sandpaper. Checked for level and square to the sides. Finally, trued the upper ridge.

pic_lyre_level

With the posts set in place, I glued on the top with yellow glue. Sanded the sides flush to the top. Then sanded the entire box for a couple of hours, running eight steps from 80-grit up to 600-grit. Gaps are visible around the posts, so I’ll fill them, then add banding to cover the flaws.

pic_lyre_postGap

 

Here’s the bottom with the patch:

 

pic_lyre_baseFinished

And the finished box:

pic_lyre_boxFinished

 

Everything goes on hold now for a two-day intensive Latin Camp. We’re going to learn PIE to Latin. After all, there just aren’t enough people who can identify an Oscan epenthetic vowel in an Old Latin borrowing.

Anglo-Saxon Lyre

Taking a short break from the natural-language parser to make a modified Sutton Hoo lyre (based partly on instrucable, possibly from Rutgers). I scoured the net for ideas, but was most impressed with Michael King’s lyre. Virtually every lyre out there is a rectangle, a squared doughnut. Having played a beautiful lyre made by my friend Jul, an incredible metalsmith and artist, I thought it nevertheless slightly awkward to hold. In this version, I reduced the size of the lyre and changed its configuration. Rather than a rectangle, I decided on a sound box attached by two long stanchions to a head-piece. (I was thinking of a double-necked guitar with a bridge between the two heads.)

05

03

This is the first idea for a layout. Sound box is basswood (hard, but easier to carve than maple, the wood used in Sutton Hoo). The two upright stanchions are white oak. The cross-piece is white oak. And the dragon heads are basswood, inspired by the Oseburg ship. I carved them with a Morakniv, simply the best carving knife I have ever used. The next step was to router out the sound box. I set my router’s depth to leave 1/8th of an inch for the bottom, planning later to carve down to 1/16th. I left two posts to hold the bases of the stanchions.

 

09

10

The knob on the inside base is for installing a nut around which the strings will be gathered. Here’s one of the stanchions fit into place:

15

After routing, I used a gouge to bring the bottom to level. Two considerations: first, the pressure on the box is down its central axis. So, rather than put in a truss rod or brace, I left the central axis 1/8th inch thick. Second, the sound has to vibrate along the bottom, so the two sides of the central axis were carved down to 1/16th.

01

02

The result left two valleys on either side of the central axis. I splayed out the base of each valley and the result was the shape of a tree. Yggdrasil, probably. Word on the web is that when thinning panels for a sound box, what matters is not thickness so much as density. So the old way to check was to hold the bottom up to a light source and look for the “fire.” A violin maker told me that this stage is called “candellighting.” Here’s the base held up to a light. The fire-red bits are 1/16th thick.

candelight

At this point, I decided that the dragon heads would hold the cross-piece rather than mount it. Here they are carved and sanded, then holding the cross-piece:

16

04

And here’s the new layout. Note the tree-shaped interior and the rather suggestive curve of the base, which I hope will give great bottom to the sound. Seriously, I wanted the sound to bounce around in there, echoing and re-echoing.

06

The next stage was the sound hole. It’s an option, but not necessary. I decided on a hole one-third of the width of the lyre, based on a guitar by Juan Cayuela, a brilliant luthier from whose descendant I bought my classical guitar. The top is a piece of rosewood, 4″ wide and 1/8th inch thick. I glued two pieces together to make a single 8″ top. Then, I routered 1/16th from the center of the board. The result was like a dinner plate, leaving a border 1/8th inch and a valley 1/16th. The bridge is a ukelele bridge from Stewart-McDonald. They also have excellent supplies of mother-of-pearl.

08

07

The sound hole looked a little bare, so I took an idea from a renaissance lute and carved an inset. Using a design from the Book of Kells, I started by tracing the sound hole on a piece of basswood. I ripped the basswood down the middle, leaving two 1/16th slabs.

11

 

After layout, I carved the figure, then carved down to the circle, leaving a raised disc. It fit in very nicely. I glued the inset nto the back with hide glue:

12

 

And here it is from the front:

13

14

Still waiting on the tuning machines and the gutstrings. I’ve also got mother-of-peral to set in. The dragon heads will be gilded and have garnet set into their eyes. More to come.

 

Old English Parser

NLP. Thanks to a grant from the CHFA at UMass, I am writing a natural language parser (NLP) for Old English (OE). Most parsers read tags placed in a text by linguists. This parser attempts to read OE as a student might. This post and the ones following are for any noodlers like me who are thinking of similar projects.

The early version of this parser lives here. It is a simple cgi-perl script that opens a series of flat-files, checking for closed-class words. That’s fine when the word is unc ‘our’, for example. The word is uninflected and unique. But what do you do with ac ‘oak, but’, which is both a noun and a conjunction?

This new version incorporates Optimality Theory.

Its design is based on the architecture of UNIX. Where UNIX runs daemons that listen for signals, this parser runs multiple scripts. The benefit of this multi-piece architecture is that I can run several scripts (pseudo-)simultaneously. So while one part of the main script is testing conditions, another part can call functions. It speeds things up.

I’ve decied to use perl. I played with python and ruby for a while, but they don’t offer significant improvements on perl (yet). Some excellent advice from Scott Kaplan at Amherst College: write it all in a language you know, then check for bottlenecks. Modify appropriately. Besides that good advice, perl offers excellent modules. One of them is memoize. It keeps the results of a designated subroutine in a hash and checks against a list of called values. The second time you run a given subroutine, if the value has already been submitted, it returns the hash. Speeds things up mightily. For a word like þa, the savings are tremendous.

For the moment, I’m running it all on a Macintosh. It has a nice UNIX kernel, easily accessed, and easily updated with macports. I can install one of thousands of free programs to complement the UNIX suite. And with perl, CPAN offers free modules that ease coding tremendously.


Wernicke

Wernicke

 LEVEL ONE. The first step is to stratify the task. I work from the bottom up. At the lowest level sits a reproduction of Wernicke’s Area. It consists of a number of text files. These are meant to reproduce a speaker’s knowledge of forms. Closed-class words are listed in various text files. Other files include lists of inflections and conjugations. A number overlap. That’s fine. At this level, the aim is to return as many possible results as can be had.

Each file is a list of words separated by “\n”. When opened, it can be read into an array by the calling script. The “\n” automatically divides the list into array elements. Some examples are prn.txt (pronouns), cnj.txt (conjunctions), and num.txt (numerals).

One complication is OE spelling. Thorn (þ) and eth (ð) are often interchangeable. In the Oxford Corpus of Old English, they are designated &t; and &d; respectively. So pattern-matching requires /( &t; | &d; )/. This variation is quite usual. Other, not so much. So West Saxon breaking, which sometimes gives ea for e and so on, cannot be hard-coded into the initial process. More unusual spelling variants are left for a later stage.

These lists are reproduced elsewhere, especially in the second stage, where particular results are grammaticalized. So, ac will return CNJ as well as information about possible noun inflections. Here’s a screen capture:
screen_ac The returned values are read into an array in the calling script. The first value is CNJ UNDCL, which means “conjunction, undeclinable.” The second value is N S M NOM S, which means “noun strong masculine nominative singular.” At a later stage, a script will take into account all the words in a given sentence and calculate the most optimal syntactic arrangement by using these values. It will also present suboptimal possibilities in a ranked list.

Similar lists are returned for open-class words,a lthough they are substantially larger. Here is a screen shot for the word godan, a weak adjective meaning ‘good’:

screen_godan


Broca

Broca

LEVEL TWO. At the second and third levels sits a reproduction of Broca’s Area. It consists of a series of scripts that first, receive data from level one. Second, they parse that data and call new scripts accordingly.

For example, given the word godan, a script called grammar.pl clears the word of any unwanted characters and chomps off any newlines. Then, it sends the word via a pipe to closed_class.plClosed_class.pl opens each of the text files in turn, asking whether or not the word is listed there. If it is, then closed_class.pl prints the results to a buffer. If not, then it returns nothing. The result is very simple: the part of speech. No inflections yet. That’s next.

If closed_class.pl returns an answer (such as PRN, or pronoun), then the next step is to find out which kind of pronoun. A call goes out to gram_prn_det.pl to see if the PRN is a determiner and if so, how it is inflected. Then, off to gram_prn_pers.pl, to see if it’s a personal pronoun, and if so, which one. Then it looks at demonstrative pronouns. The word is tested in a simple if statement that looks like this:

if ($word =~ /^ic\b/i){print “PRN PERS 1 NOM S\n”;} # 1st p s
if($word =~ /^me\b/i){print “PRN PERS 1 ACC S\n”;}
if($word =~ /^min\b/i){print “PRN PERS 1 GEN S\n”;}
if($word =~ /^minne\b/i){print “PRN PERS 1 GEN S\n”;}
if($word =~ /^me\b/i){print “PRN PERS 1 DAT S\n”;}

Because I use a for-loop without a counter, the value $_ is not assumed, thus the variable $word. To ensure that the pattern is a whole world only, I use ^ (match at beginning of string) and \b (word-break). That way, I won’t accidentally match wic when I’m looking for ic. Because pronouns are so important for understanding the syntax of an OE sentence, they will receive a very high value when they are processed by the syntax script.

Once that precedure is over, all closed-class words have been dealt with.

Next, grammar.pl checks open-class words. As of 6/15/2015, I have listed nominal and adjectival inflections. If a word ends in a nominal or adjectival inflection, that result is added to the buffer. All possibilities are entertained at this point. Godan ends in –an, which could also indicate an infintive. The third level will deal with that option as it sifts through all the words in a sentence at once.

 

Germanization of English

It seems that one of the background processes in American English is an increase in adnominal adjectives. You don’t make a choice about a college, but a college choice. The prepositional phrase is turned into a pre-position adjective, turning a noun into an adjective, rendering a compound noun worthy of German. I’ve noticed hundreds. Academics don’t have meetings of the faculty, they have faculty meetings; they no longer discuss the curriculum, they have curriculum discussions. In a recent memo, someone was described as a community heritage preserver, which is an astounding way of saying that she preserves the heritage of her community.

So what? Well, a tea cup is a thing to drink from, and a cup of tea is an amount of tea–I’m making a cup of tea is not the same as I’m making a tea cup. Likewise, a meeting of the faculty is a meeting, plain and simple, comprised of faculty. A faculty meeting is a kind of meeting. The first form describes a genus (meeting) populated by a species (faculty); the second introduces a new genus. English speakers do not have kinds of meetings (people meetings, carnival meetings, everyone-in-red meetings, cornhusker-fan meetings, and so forth). English speakers have meetings, plain and simple. Although English speakers can figure out what is meant by both forms, the compounded form adds unnecessary and sometimes misleading ambiguity.

Update [7/19/2015]: The metathesis of adnomial genitives (e.g., leaf of laurel) into denomial adjectives (laurel leaves, Lat. folia laurea) appears to have been popular during the late Republic and early Empire in Rome. Cicero shied from the pattern, preferring the genitives.

So, let’s say that someone studies metaphors of feces in the works of Chaucer, as in “fecopoetics“–read up on it for $105.00. Such a study would actually be a study of images of feces. (Chaucer left us no actual feces.) The order of nouns runs general-and-inclusive to specific-and-exclusive: a study, which is of images, and more specifically of images of feces. The last two nouns and their accompanying words get compounded to feces images, or rather, fecal images, which wrongly introduces a new genus of images. English speakers do not have categories of images (images of birds, images of Joe, images of animals recently returned from drinking, and so forth). English speakers have images, plain and simple. The compounding form fecal is then transferred backwards and upwards to make a new genus of study: fecal study. Or, more likely, pluralized to fecal studies.

Thus arises the proliferation of areas of study. We have fecal studies, gender studies, postcolonial studies, Marxist studies, and so forth. The implication is that each of these compounds represents a new field, which is incorrect. All belong to the same academic field: the study of culture, called “cultural studies.” And all employ the same general methods of the field of cultural studies. What differs is the object of study. One decent and clever fellow, Asa Mittman, wants to start the discipline of monster studies since he studies images of monsters. I honestly don’t see a university opening up a hiring line in monster studies. But I do see universities regularly hiring in cultural studies. So why hide the obvious strength of a mutually-supportive, communal enterprise in divisive compounds—or is that “compounds of divisiveness”?

UPDATE (1/29/2019): A terrific example just arrived by email, a description of a presentation. It reads:

Most methods for relation extraction from text rely on pre-trained entity resolution models in order to find the entities mentioned in text. Text-enhanced knowledge graph (KG) completion methods also rely on such an entity resolution model as a pre-processing step. We present a method to simultaneously learn entity resolution as well as relation extraction or KG completion without relying on a pre-trained entity resolution model or mention-level entity resolution data for training.

Wow! We can see a series of genitive + Noun-ion. So, to extract relations gets changed by turning the object (relations) into a plain adjective, and putting it before a noun ending in -ion (extraction). Similarly, a model that has been pre-trained to resolve entities becomes “pre-trained entity resolution model.” Try it at home. Consider a computer that has been programmed to calculate averages: would that be a programmed average-calculating computer. What a mess! But it’s the new norm. If you want to sound scientific, you take all of your descriptive or appositive phrases and cram them into adjectival positions. That would be positionally crammed adjectival phrase scientism.