Old English Parser 4

The next step is to isolate syntactic clusters. These clusters comprise phrases, words in a series, and so forth. An example is a prepositional phrase (PP). A PP is made up of a preposition (PRP) and an object noun. It may also contain any number of modifiers (adjectives, determiners, pronouns) and conjunctions (CNJ). So, “I saw the house of oak and stone.” The PRP is “of,” the CNJ is “and,” and both “oak” and “stone” are dative object nouns—called objects of the preposition. We read from left to right in English. So we read “of” then process the rest of the PP. A parser should do the same.

In Old English, we also read from left to right. Consequently, the position of each word in a sentence and a phrase is important. Although OE poetry allows for significant hyperbaton (messing up the syntax), it is rare that a preposition would sit after its object(s). In fact, prepositons are relatively unusual in poetry since a dative or accusative inflection already gives that information to the reader. Consequently, the sentence will be held in an indexed array. The left-most word will be read in first and assigned the lowest number.

The first syntactic cluster I’ll attempt is the prepositional phrase. The reason to isolate this phrase first is that neither the main verb nor the subject or the direct object of a sentence can be placed in a PP. By knocking the PP out of contention, we have a better chance at fiding the subject, object, and verb.

A big challenge here is how to present the results.

Here’s one option:

PP | of [PRP] oak [N{dat}] and [CNJ] stone [N{dat}]

That presentation can then be turned into html. The sentence could be printed and the PP appear in, say, green type. And when a user hovers over a word, its grammatical information could be revealed. But that’s inelegant, since the user has to interact with the data in order to make it productive. (The user already knows the sentence, so why return it again?)

Another option is to present the data in a table (which I can’t do easily in this blog).

PP | PRP | Nd | CNJ | Nd

…..| of | oak | and | stone

The cells of the table can then be colored green for a PP. In this case, the important data is returned first (the parsed results), and the sentence is returned as a reminder to the user.

Because the parser is intended to return optimal and suboptimal results, the table will have to have room for all of them. At the moment, I’m thinking of returning each complete parse as a table row, beginning with the optimal result, then proceeding downwards to the least optimal result. Here’s an example.

Wiht
unhælo
grim
CNJ
ond
grædig
gearo
sona
wæs
reoc
CNJ
ond
reþe
CNJ
ond
PRP
on
ræste
genam
þritig
þegna
þanon
ADV
eft
gewat
huðe
hremig
PRP
to
ham
faran
PRP
mid
ART
þære
wælfylle
wica
neosan

A sentence from Beowulf with the conjunctions (CNJ), prepositions (PRP), some adverbs (ADV), and articles/determiners (ART/DET) marked. You’ll notice that sona is an adverb, as is þanon. The old parser missed those.

Meanwhile, I have only just discovered a NLP parser from the CS Dept at UMass. It is called Factorie (Home Page). It appears to use the NLP server familiar to users of python. Although it operates on North-American English, it can be trained to parse other languages. That training may be more time-intensive than writing my own parser.