Old English Parser 2

stormtrooperSIMPLIFICATION. It is tempting to create lists that would quickly distinguish forms. For example, OE eadig ‘blessed’ is obviously an uninflected adjective (–ig). It could be almost nothing else. But the task of discerning its word class belongs to the syntactic strata, to syntax.pl. At the most basic level, eadig could be noun with a consonantal stem. It could be a name (> Eddie). So, at the most basic level, a NLP has to be greedy. It has to take in as many options as possible. Only then can the syntactic strata apply constraints.

A second example is swa. The word can be a conjunction ‘because, so that’ or an adverb ‘thus.’ So it has to be listed in both files—conjunctions and adverbs. The bottom strata of the parser has to return both possibilities to the calling script, ADV and CNJ.

Some words are clearly one thing or another. Gea ‘yes’ is nothing else but ‘yes’. Ne is a negative particle. At the moment, I have those words listed in the grammar script that calls out for the lists. But I will eventually put them all in their own list (e.g., uninflected.txt). Thus, for now:

if ($word =~ /^ne\b/i){push @results, “NEG UNDCL\n”;}
if ($word =~ /^gea\b/i){push @results, “INTJ UNDCL\n”;}
if ($word =~ /^eala\b/i){push @results, “INTJ UNDCL\n”;}

These lines say: if the word matches the pattern between the forward slashes, then add to the buffer of results the following. Later, in the syntax strata, I will have to check if there are two ne‘s (ne … ne ‘neither … nor’), since that will have to be parsed as a correlative conjunction. Similarly, swaþa, and so forth. Correlative conjunctions and compound adverbs will be listed in a rule set. That is the single most challenging portion of the parser: the rule set.

At this point, I don’t want the parser to worry about what a word is; I want it to list what it could be.

Update: Perl is such a thoughtful language that I was able to simpify the architecture further. Initially, I had created word lists. Then, I checked whether a target word was on one of the lists. If it was on the conjunction list, I returned “cnj” to the calling function. But there was no need for the extra layer. I turned the lists into if-then statements, and ran them as perl scripts from grammar.pl. No reason to repeat the information: just place it once inside a script. If a word is there, return the word’s class and grammatical information. No more text files. No more filehandles. No more closed class, open class questions. It’s reading all closed-class words, all nominal infletions, and all adjectival inflections.

I’ve got the whole parser down to 24.5 KB! Small as an icon! But because it executes 14 scripts, it’s already slowing down. About 1.8s to execute on a single word.

Leave a Reply