Old English word shape

Crosswords are intriguing programming problems. How do you generate a New-York-Times-style crossword puzzle from a list of words? During an attempt (in Old English of course), I noticed some very interesting features of Old English words.

Consider the upper-left section of the crossword puzzle. An easy solution to generating words is to map the phonological shape of words before searching for instances of that shape in a list. So, an easy shape is consonants (C) and vowels (V) alternating. If 1-across is C-V-C-V, then the next word down, 13-across, is V-C-V-C. Then, you search the list of OE 4-letter words for that shape.

Say 1-across is C-V-C-V, bana ‘murderer’. 13-across might be V-C-V-C, eþel ‘homeland’. That sets up 1-down to start with be– and 2-down to start with æþ-. You’d think there would be plenty of words to fit that scheme.

But there are not! After extracting all words from the poetic corpus of Old English, I divided them into 3-, 4-, 5-, 6-, and 7-letter word lists. There are 133 tokens (words rather than lexemes) with the shape V-C-V-C:

abæd, abal, aban, aber, abit, acas, acol, acul, acyr, adam, adan, aðas, ades, adon, aðum, æcer, æðel, æfen, æfyn, ægen, æled, æleð, ænig, ænyg, æren, æres, æror, ærur, ætan, æten, ætes, æton, ætyw, æxum, afyr, agæf, agan, agar, agef, agen, agif, agof, agol, agon, agun, ahef, ahof, ahon, alæd, alæg, alæt, ales, alyf, alys, amæt, amen, amet, amor, anes, anum, arað, aras, ares, arim, aris, arod, arum, atol, aweg, awer, awoc, awox, axan, aþas, ecan, eces, ecum, eðan, eðel, edom, efen, enoc, enos, eror, etan, eteð, evan, eþel, ican, iceð, idel, ides, iren, isac, isen, isig, oðer, ofæt, ofen, ofer, ofet, ofir, ofor, onet, open, oreb, oroð, oruð, ower, oxan, oþer, ufan, ufon, ufor, upon, ures, urum, user, usic, utan, uten, uton, ycað, ycan, yced, yðum, yfel, ytum, ywan, ywaþ, ywed, yweð, yþum

And words with the shape C-V-C-V number 484:

bacu, baða, bæce, bæle, bære, bana, bare, baru, baþu, bega, bena, bene, bera, bere, bete, beþe, bide, bite, boca, boda, body, boga, bona, bote, bure, buta, bute, butu, byge, byme, byre, cafe, care, cele, cene, cepa, ciða, cile, come, cuðe, cuma, cume, cuþe, cyle, cyme, cymu, cyre, cyþe, dæda, dæde, dæge, dæle, dæne, ðæra, ðære, daga, dalu, ðane, ðara, dare, dege, dema, demæ, deme, dena, dene, ðere, ðine, dole, doma, dome, ðone, duna, dune, dura, dure, duru, dyde, ðyle, dyne, dyre, faca, fæce, fæge, fæla, fæle, fære, fana, fane, fara, fare, feða, feðe, fela, fele, fere, feþa, feþe, fife, fira, fire, five, fore, fota, fote, fula, fule, fuse, fyra, fyre, gara, gare, gatu, gedo, gena, geno, gere, geta, gife, gifu, gina, goda, gode, godu, gota, guðe, guma, gume, gute, guþe, gyfe, gyme, gyta, gyte, hada, hade, hæle, hælo, hælu, hæse, hæto, hafa, hafo, hafu, hale, hali, hama, hame, hara, hare, hata, hate, hefe, hege, hele, helo, here, hete, hewe, hige, hina, hine, hira, hire, hiwa, hiwe, hofe, hofu, hole, hopa, hope, horu, huðe, huga, huna, huru, husa, huse, huþa, huþe, hyde, hyðe, hyge, hyne, hyra, hyre, hyse, lace, laða, lade, laðe, læce, læde, læla, læne, lænu, lære, læte, lafe, lage, lago, lagu, lama, lame, lara, lare, lata, late, latu, laþe, lefe, lega, lege, lene, lete, lica, lice, lida, liða, lide, liðe, life, lige, lima, lime, liþe, liþu, locu, lofe, lufa, lufæ, lufe, lufu, lyfe, lyge, lyre, mæca, mæða, mæga, mæge, mæla, mæle, mæne, mæra, mære, mæro, mæru, mæte, maga, mage, mago, magu, mana, mane, mara, mare, meca, mece, meda, mede, meðe, medo, medu, melo, mere, mete, meþe, mide, mila, mine, moda, mode, modi, mona, more, mose, mote, muðe, muþa, muþe, myne, naca, næle, næni, nære, næte, nama, name, nane, neda, nede, nefa, nele, niða, niðe, nime, nine, niwe, niþa, niþe, noma, nose, noþe, nyde, nyle, race, racu, rade, raðe, ræda, ræde, raþe, rece, reða, reðe, rene, reþe, rica, rice, ricu, ride, rime, ripa, ripe, rode, rofe, rome, rope, rowe, rume, runa, rune, ryha, ryne, sace, sacu, sade, sæce, sæda, sæde, sæge, sæla, sæle, sæne, saga, sale, salo, salu, same, sara, saræ, sare, sari, sece, seðe, sefa, sege, sele, seme, sene, sete, sida, siða, side, siðe, sido, sige, sile, sina, sine, site, siþa, siþe, soða, soðe, some, sona, sone, soþa, soþe, sume, suna, suno, sunu, syle, syne, synu, sype, syre, syþe, tæle, tæso, tala, tale, tame, tane, tela, tene, tida, tiða, tide, tiðe, tila, tile, tima, tire, toða, tome, toþe, tuge, tyne, waca, wace, wada, wade, waðe, wado, wadu, waðu, wæda, wæde, wædo, wædu, wæge, wæle, wæra, wære, wæta, wage, wala, wale, walo, walu, wana, ware, waru, wega, wege, wela, wena, wene, wepe, wera, were, wese, wica, wida, wide, widu, wifa, wife, wiga, wige, wile, wina, wine, wire, wisa, wise, wita, wite, witu, woða, woma, wope, wora, woþa, wuda, wudu, wule, wuna, wyle, þæce, þæne, þæra, þære, þane, þara, þare, þine, þire, þone, þyle, þyre

Notice how many end with dative singular markers like –e. It suggests that we are looking at inflected forms of a C-V-C shape, one of the most common Proto-Indo-European root forms. Again, in the poetic corpus, there are 350 such tokens:

bad, bæc, bæd, bæð, bæg, bæl, bæm, bær, bam, ban, bat, bec, bed, beg, ben, bet, bid, bið, bil, bit, biþ, boc, boð, boh, bot, bur, byð, byþ, cam, can, cen, cer, ces, cið, col, com, con, cuð, cum, cuþ, cyð, cym, cyn, dæd, dæg, dæl, ðæm, ðær, ðæs, ðæt, ðah, ðam, ðan, ðar, ðas, day, ðec, deð, ðeð, ðeh, dem, ðem, ðer, ðes, ðet, deþ, dim, ðin, ðis, doð, dol, dom, don, ðon, dor, doþ, dun, ðus, dyn, ðyn, ðys, fæc, fær, fæt, fag, fah, fam, fan, far, fed, fel, fen, fet, fex, fif, fin, foh, fon, for, fot, ful, fus, fyl, fyr, fys, gad, gað, gæd, gæð, gæþ, gal, gan, gar, ged, gem, gen, get, gid, gif, gim, gin, git, god, guð, gyd, gyf, gyt, had, hæl, hær, hæs, hal, ham, har, hat, heg, heh, hel, her, het, hig, him, his, hit, hiw, hof, hoh, hol, hun, hus, hyd, hyð, hys, hyt, lac, lad, lað, læd, læf, læg, læn, lær, læs, læt, laf, lah, lar, laþ, lef, leg, len, let, lic, lid, lið, lif, lig, lim, lit, liþ, loc, lof, log, lot, lyt, mað, mæg, mæl, mæn, mæt, mæw, man, mec, men, mid, mið, min, mit, mod, mon, mor, mos, mot, muð, muþ, næs, nah, nam, nan, nap, nas, nat, neb, ned, neh, nes, nið, nim, nis, niþ, nom, non, num, nyd, nys, nyt, pyt, rad, ræd, ran, rec, ren, rex, rib, rim, rod, rof, rot, rum, run, ryn, sæd, sæl, sæm, sæp, sæs, sæt, sag, sah, sal, sar, sec, sel, sem, sib, sic, sid, sið, sin, sit, siþ, soð, sol, soþ, suð, sum, syb, syn, syx, syþ, tan, teð, teþ, tid, til, tin, tir, tor, tun, tyd, tyn, tyr, wac, wæf, wæg, wæl, wæn, wær, wæs, wæt, wag, wah, wan, was, wat, web, wed, weg, wel, wen, wep, wer, wes, wet, wic, wid, wið, wif, wig, win, wir, wis, wit, wiþ, woc, wod, woð, woh, wol, wom, won, wop, wyð, wyl, wyn, wyt, zeb, þæh, þæm, þær, þæs, þæt, þam, þan, þar, þas, þat, þec, þeh, þem, þer, þes, þet, þin, þis, þon, þus, þyð, þyn, þys

But the alternate, V-C-V has far fewer instances. I count 51:

ace, ada, aða, ade, aðe, ado, æce, æna, æne, æni, æra, æse, æte, æwæ, aga, age, ana, ane, ara, are, awa, awo, eca, ece, eci, eðe, ege, ele, ely, esa, ete, eþe, iða, ige, ipe, oga, ore, oxa, uðe, una, ura, ure, uta, ute, utu, uþe, yða, yðe, yne, yþa, yþe

269 of the CVC forms overlap with the CVCV forms, suggesting that

  1. CVCV forms that overlap with CVC forms are inflections of the root, or
  2. they are coincidentally similar and represent two different lexemes

As I continue to refine my OE Parser, I wonder whether employing PIE root forms might be useful in identifying lexemes. Certainly, when I turn to programming James E. Cathey’s tremendous diachronic phonology of Germanic languages, root form/shape will play an essential role. One of the methods I wrote in python checks for root form/shape, and I hoped to use it to identify spelling variants—allowing variation only in root vowels of a form. So: C-V(1)-C, C-V(2)-C, … C-V(n)-C.

Back to the crossword!

Update on a Parser-Tagger of Old English

Ottawa, Ontario
5 April 2019

Screen shot of Tagger

Method

Over the last eight months my approach to tagging an untagged sentence of Old English has been three-fold.
  1. First, I perform a simple look-up using four dictionaries (Bosworth-Toller, Clark-Hall, Bright’s glossary, a glossary of OE poetry from Texas), then save the results.
  2. Second and independently, I run the search-term through an inflectional search, returning the most likely part of speech based on suffixes and prefixes, then generate a weight based on whether or not that POS is closed-class or open-class. Those results are also saved.
  3. Third and finally, I check the search-term against a list of lemma that I compiled by combining Dictionary of Old English lemma and Bosworth-Toller lemma. If the lemmata is not found in the list of lemma, then I send it to an inflectional search, take the returned inflectional category and generate all possible forms, then search the list for one of those forms; if the form matches an existing lemma, then I break; if not, I do it again by taking the next most likely part of speech. Those results are also saved.

After these three steps run independently on the target sentence, I compare all three sets of saved results and weigh them accordingly. No possibilities are omitted until syntactic information can be adduced.

(Although I haven’t written it up, a search of the Helsinki Corpus might be useful as a fourth approach: if the term is parsed in the YCOE, that information could add to the weight of likelihood.)

Taking three approaches and comparing three sets of results is about 85% accurate.

Syntax

In order to improve the weights on the guesses, I’m writing a python class to guess syntactic patterns. I would like the class to examine the words without any information on their inflections or roots. The percentages here are not very good, but if you accumulate guesses, then accuracy improves (solving for theta). So, I look at
  1. the position of the term in a sentence. The percentages here are barely useful. If the term is in the first half of a prose sentence, then it is more likely than not (51%) to be something other than a verb or adverb. If the term is in the second half of the sentence, then it is more likely than not to be a verb or adverb. These percentages are discovered by parsing all sentences in the Corpus except those that derive from OE glosses on Latin—where underlying Latin word-order corrupts the data.
  2. its relative position with respect to closed-class words. These percentages are a little more useful.  For example, if the term follows a determiner, then it is more likely to be a noun or adjective than to be a verb.
  3. whether or not it is in a prepositional phrase and if so where. The word immediately following the preposition is likely either a noun or an adjective.
  4. whether or not it alliterates with other words in the sentence (OE alliteration tends to prioritize nouns, adjectives, and verbs).

The point of this python class is to come to a judgment about the part of speech of a term without looking it up in a wordlist. So far, a class meant to identify prepositional phrases works fairly well—I still need to deal with compound objects.

Screen shot of tagger with Prepositional Phrases

 

You’ll notice in the screenshot above that the tagger returns prepositional phrases. If you know python, you can see that highly likely tags are returned as strings and that less likely tags are returned in a list. This distinction in data types allows me to anticipate the syntax parser with a type() query. If type() == list, then ignore. You’ll notice that it has mischaracterized the last word, gereste, as a noun or adjective. It is a verb.

Next ?

The last step is to merge the two sets of weights together and select the most likely part of speech for a word. Since the result is so data-rich, it allows a user to search for syntactic patterns as well as for words, bigrams, trigrams, formulae, and so forth.

So, a user could search for all adjectives that describe cyning ‘king’ or cwen ‘queen’. Or find all adjectives that describe both. Or all verbs used of queens. Or how may prepositional phrases mention clouds.

Bigrams

9 March 2019. Puzzling out a word jumble, I’m writing a python script to search a grid for words. Step one is to compile a list of legal bigrams in English. Bigrams are two letters that go side-by-side. So the letter <Q> in English has a limited list of bigrams. We see <QU> as in quit, <QA> as in Qatar (and a few others if you allow very rare words).

I found a huge list online of English words compiled from web pages. 2.5 megs of text file! Here is the resulting python dict of bigrams:

{'A':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
'B':['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'G', 'P', 'Z', 'Q'],
'C':['A', 'I', 'K', 'T', 'U', 'E', 'O', 'Y', 'H', 'C', 'L', 'M', 'N', 'Q', 'R', 'S', 'D', 'B', 'W', 'Z', 'G', 'P', 'F'],
'D':['V', 'W', 'E', 'I', 'O', 'L', 'N', 'A', 'U', 'G', 'Y', 'R', 'P', 'C', 'D', 'F', 'H', 'J', 'M', 'S', 'T', 'Z', 'B', 'K', 'Q'],
'E':['H', 'R', 'D', 'N', 'E', 'S', 'M', 'Y', 'V', 'L', 'A', 'C', 'I', 'P', 'T', 'K', 'Z', 'U', 'G', 'W', 'B', 'F', 'O', 'X', 'Q', 'J'],
'F':['F', 'T', 'A', 'U', 'O', 'E', 'I', 'Y', 'L', 'G', 'R', 'S', 'W', 'Z', 'N', 'V', 'H', 'B', 'K', 'D', 'M', 'J', 'P', 'C'],
'G':['I', 'E', 'H', 'L', 'N', 'A', 'Y', 'O', 'R', 'M', 'U', 'S', 'D', 'G', 'K', 'P', 'B', 'W', 'T', 'F', 'C', 'V', 'J', 'Z'],
'H':['R', 'E', 'L', 'M', 'I', 'Y', 'O', 'U', 'A', 'T', 'N', 'S', 'W', 'B', 'P', 'Z', 'G', 'C', 'F', 'D', 'H', 'J', 'K', 'V', 'Q'],
'I':['C', 'T', 'N', 'S', 'O', 'E', 'A', 'Z', 'R', 'L', 'D', 'U', 'P', 'G', 'B', 'V', 'F', 'M', 'I', 'X', 'K', 'Y', 'W', 'H', 'Q', 'J'],
'J':['E', 'O', 'U', 'A', 'I', 'H', 'J', 'R', 'Y', 'P', 'D', 'M', 'W', 'L', 'T', 'N', 'B', 'K'],
'K':['A', 'H', 'E', 'I', 'Z', 'M', 'N', 'B', 'S', 'L', 'O', 'C', 'K', 'P', 'R', 'T', 'U', 'W', 'Y', 'D', 'F', 'G', 'J', 'V'],
'L':['F', 'L', 'U', 'I', 'O', 'E', 'Y', 'A', 'M', 'T', 'S', 'N', 'V', 'C', 'D', 'B', 'G', 'H', 'P', 'R', 'K', 'W', 'J', 'Q', 'Z', 'X'],
'M':['A', 'P', 'E', 'B', 'I', 'O', 'H', 'U', 'Y', 'M', 'S', 'T', 'F', 'L', 'W', 'N', 'R', 'C', 'G', 'V', 'K', 'D', 'J', 'Z', 'Q'],
'N':['I', 'A', 'C', 'E', 'D', 'T', 'U', 'O', 'S', 'R', 'G', 'Y', 'M', 'N', 'Z', 'L', 'P', 'K', 'F', 'H', 'Q', 'B', 'J', 'V', 'X', 'W', '-'],
'O':['L', 'N', 'R', 'S', 'I', 'M', 'T', 'U', 'G', 'O', 'W', 'A', 'B', 'D', 'H', 'V', 'X', 'C', 'K', 'Z', 'P', 'Y', 'E', 'F', 'Q', 'J'],
'P':['E', 'T', 'O', 'Y', 'I', 'H', 'S', 'R', 'A', 'N', 'U', 'L', 'P', 'M', 'J', 'B', 'D', 'F', 'W', 'K', 'C', 'G', 'V', 'Q'],
'Q':['U', 'I', 'A', 'R', 'E', 'O', 'Q'],
'R':['D', 'O', 'U', 'E', 'A', 'I', 'T', 'Y', 'R', 'S', 'V', 'M', 'B', 'P', 'G', 'N', 'H', 'L', 'F', 'W', 'C', 'K', 'J', 'Q', 'X', 'Z'],
'S':['C', 'T', 'A', 'E', 'S', 'I', 'G', 'H', 'K', 'O', 'M', 'U', 'F', 'Q', 'V', 'Y', 'P', 'L', 'N', 'B', 'W', 'D', 'R', 'J', 'Z'],
'T':['E', 'I', 'O', 'H', 'A', 'T', 'U', 'C', 'N', 'S', 'R', 'M', 'L', 'Y', 'B', 'P', 'F', 'W', 'K', 'Z', 'D', 'G', 'J', 'V', 'Q', 'X'],
'U':['A', 'S', 'L', 'R', 'C', 'M', 'N', 'D', 'T', 'E', 'V', 'P', 'Z', 'B', 'I', 'O', 'X', 'G', 'K', 'F', 'Y', 'W', 'J', 'H', 'Q', 'U'],
'V':['A', 'E', 'I', 'O', 'U', 'Y', 'S', 'R', 'C', 'L', 'V', 'N', 'Z', 'D', 'K', 'G'],
'W':['O', 'H', 'A', 'E', 'I', 'L', 'N', 'S', 'T', 'R', 'M', 'U', 'Y', 'B', 'P', 'W', 'D', 'F', 'K', 'C', 'G', 'Z', 'Q', 'V', 'J'],
'X':['I', 'A', 'Y', 'T', 'E', 'O', 'U', 'M', 'P', 'C', 'B', 'F', 'H', 'L', 'S', 'W', 'R', 'D', 'K', 'N', 'G', 'Q', 'Z', 'V'],
'Y':['S', 'M', 'A', 'R', 'C', 'P', 'G', 'I', 'L', 'N', 'D', 'T', 'X', 'O', 'E', 'Z', 'U', 'F', 'W', 'H', 'B', 'Y', 'K', 'V', 'J', 'Q'],
'Z':['E', 'A', 'U', 'Z', 'I', 'O', 'L', 'G', 'Y', 'R', 'H', 'T', 'N', 'B', 'D', 'P', 'K', 'C', 'M', 'V', 'S', 'F', 'W']
}

And here is the code to get the bigrams (my file of words is called web2.txt, and each word is on a separate line). In order to limit the bigrams to a list of unique letters, I use set().

import os

path = os.getcwd()
path += '/web2.txt'

bigrams = {'A':[], 'B':[], 'C':[], 'D':[], 'E':[], 'F':[], 'G':[], 'H':[], 'I':[], 'J':[],
           'K':[], 'L':[], 'M':[], 'N':[], 'O':[], 'P':[], 'Q':[], 'R':[], 'S':[], 'T':[],
           'U':[], 'V':[], 'W':[], 'X':[], 'Y':[], 'Z':[]}

with open(path, 'r') as allwords:
    words = allwords.read().split('\n')
    allwords.close()

for letter in bigrams.keys():
    letter = letter.upper()

    for word in words:
        word = word.upper()
        if letter in word:
            if word.index(letter) < len(word):
                try:
                    nextletter = word[word.index(letter) + 1]
                    if nextletter not in set(bigrams[letter]):
                        bigrams[letter].append(nextletter)
                except IndexError:
                    continue

    print('\'{0}\':{1}, '.format(letter, bigrams[letter]))

Bigrams

March 6. An interim step in making a semantic map of Old English is producing bigrams. Bigrams are pairs of words. In order to build a social network of words, you need to know which words connect to one another. For example, in Beowulf, the word wolcnum ‘clouds’ almost always sits next to under ‘under’.

By comparison, the epic poem Judith has no clouds in it. And the homilist Ælfric never uses the phrase under wolcnum.

Here is a screen shot of words that follow ic ‘I’ in the poem Beowulf. So, the first is “ic nah.”

You can see that there are 181 instances of ic, although only 80 are unique. In other words, some bigrams are repeated. The second word of the bigram is printed again in red, and passed to a part-of-speech tagger. The blue text is the tagger’s best guess, and it also returns the part-of-speech most cited by dictionaries. As I plan to discuss in an article, ic is very rarely followed by a verb.

We can discover a great deal about poetic style by looking very closely at the grammar of Old English poetry. The grammar is the unfolding in time of images and ideas and asides and so forth. Grammar describes how the words affect you in order as you read.

Poetic Words

sort | uniq

Has anyone has done this since Angus Cameron suggested it in 1973? I separated the Corpus of Old English into genres and sub-genres. It enabled me to find words unique to poetry. The poetic texts are largely from the ASPR, but include Chronicle poems, the Meters of Boethius, and others.

First, I sorted the words into alphabetical order and removed duplicates. Second, I did the same for all prose texts. I also removed all foreign words from the prose texts—those are words that the Dictionary of Old English designated as foreign by placing them within <foreign> tags. Third, I compared prose words with poetic words. The resulting list is a set of all words used only in the poetic texts. Here is the file (right-click to download): PoeticWords

The next step is to classify each word by word class. That will allow me to differentiate verbal phrases from noun phrases in the poetry. Once noun phrases are isolated, I can begin to build a semantic map of poetic discourse in Old English. Afterwards, I’ll add verb phrases. So we’ll be able to know how OE poets described queens (adjectives) and what sort of acts queens performed (verbs), and compare that to descriptions of kings and the acts they performed. We can then further differentiate dryhten from cyning, and cwen from ides. But there’s a big caveat.

Because Old English poets wrote alliterative verse, adjectives and verbs may have been chosen simply on account of their initial sound. So, cwen may have attracted /k/-initial words. That is why it is essential to also build a map in prose of cwen. Since the formal structure of prose was not governed by alliteration (with the possible exception of Ælfric), the map in prose and the map in poetry of any given noun might well be distinct.

Fulbright Project

View from Dunton Tower at Carleton University looking north along the Rideau River towards the city of Ottawa.

I am very fortunate this year to have received a Fulbright award. The College of Humanities and Fine Arts at UMass made it possible for me to spend the academic year at Carleton University in Ottawa, Ontario, Canada. While here, I’m working on a natural-language parser of Old English, which I will use to create a semantic map of Old English nouns. In short, I want a computer to recognize an Old English noun and then find all words associated with it. Nouns are names for entities in the world. So a semantic map tells us something about how a language permits people to associate qualities with entities.

Following in the footsteps of Artificial Intelligence researchers like Waleed Ammar of the Paul Allen Institute, I will be using untagged corpora—that is, texts that no one has marked up for grammatical information. I would like to interfere with the data as little as possible.

What makes this project different from similar NLP projects is my aim. I want to produce a tool that can be used by literary critics. I am not interested in improving Siri or Alexa or a pop-up advertisement that wants to sell you shoes. Neither is my aim to propose hypotheses about natural languages, which is a general aim of linguistics-related NLPs. So, the object of my inquiry is artful writing, consciously patterned language.

STAGE ONE

The first stage is to write a standard NLP parser using tagged corpora. The standard parser will serve to check any results of the non-standard parser. Thanks to the generosity of Dr. Ondrej Tichý of Charles University in Prague, the standard parser is now equipped with a list of OE lexemes, parsed for form. A second control mechanism is the York-Helsinki Parsed Corpus of Old English, which is a tagged corpus of most of Aelfric’s Catholic sermons.

STAGE TWO

At the same time, I divided the OE corpus into genres. In poetic texts, dragons can breathe fire. But in non-fictional texts, dragons don’t exist. So a semantic field drawn around dragons will change depending on genre. I am subdividing the poetry according to codex, and then according to age (as far as is possible) to account for semantic shift. Those subdivisions will have to be revised, then abandoned as the AI engine gets running. (I’ll be using the python module fastai to implement an AI.)

Notes

Unicode. You’d think that searching a text string for another text string would be straightforward. But nothing is easy! A big part of preparing the Old English Corpus for manipulation is ensuring that the bits and bytes are in the right order. I had a great deal of difficulty opening the Bosworth-Toller structured data. It was in UTF-16 encoding, similar to the character encoding used on Windows machines. When I tried to open it via python, the interpreter threw an error. It turns out, Unicode is far more complex than one imagines. Although I can find ð and þ, for example, I cannot find them as the first characters of words after a newline (or even regex \b). Another hurdle.

Overcame it! The problem was in the data. For reasons unknown to me, Microsoft Windows encodes runic characters differently than expected. So the solution was to use a text editor (BB Edit), go into the data, and replace all original thorns with regular thorns. Same for eth, asc, and so forth. Weirdly, it didn’t look like I was doing anything: both thorns looked identical on the screen.

 

Screen shot of parser guts so far. Markup is data from Tichy’s Bosworth-TollerInflect receives Markup, then returns inflections based on the gender of strong nouns. Variants and special cases have not yet been included.

To finish STAGE ONE, I’ll now inflect every noun, pronoun, and adjective, then conjugate every verb as they come in (on the fly). Andrej Tichý at Charles University in Prague, who very generously sent me his structured data, took a slightly different approach: he generated all the permutations first and placed them into a list. Finally, as a sentence comes in, I’ll send off each word to the parser, receive its markup and inflections/conjugations, then search the markup for matches.