Old English word shape

Crosswords are intriguing programming problems. How do you generate a New-York-Times-style crossword puzzle from a list of words? During an attempt (in Old English of course), I noticed some very interesting features of Old English words.

Consider the upper-left section of the crossword puzzle. An easy solution to generating words is to map the phonological shape of words before searching for instances of that shape in a list. So, an easy shape is consonants (C) and vowels (V) alternating. If 1-across is C-V-C-V, then the next word down, 13-across, is V-C-V-C. Then, you search the list of OE 4-letter words for that shape.

Say 1-across is C-V-C-V, bana ‘murderer’. 13-across might be V-C-V-C, eþel ‘homeland’. That sets up 1-down to start with be– and 2-down to start with æþ-. You’d think there would be plenty of words to fit that scheme.

But there are not! After extracting all words from the poetic corpus of Old English, I divided them into 3-, 4-, 5-, 6-, and 7-letter word lists. There are 133 tokens (words rather than lexemes) with the shape V-C-V-C:

abæd, abal, aban, aber, abit, acas, acol, acul, acyr, adam, adan, aðas, ades, adon, aðum, æcer, æðel, æfen, æfyn, ægen, æled, æleð, ænig, ænyg, æren, æres, æror, ærur, ætan, æten, ætes, æton, ætyw, æxum, afyr, agæf, agan, agar, agef, agen, agif, agof, agol, agon, agun, ahef, ahof, ahon, alæd, alæg, alæt, ales, alyf, alys, amæt, amen, amet, amor, anes, anum, arað, aras, ares, arim, aris, arod, arum, atol, aweg, awer, awoc, awox, axan, aþas, ecan, eces, ecum, eðan, eðel, edom, efen, enoc, enos, eror, etan, eteð, evan, eþel, ican, iceð, idel, ides, iren, isac, isen, isig, oðer, ofæt, ofen, ofer, ofet, ofir, ofor, onet, open, oreb, oroð, oruð, ower, oxan, oþer, ufan, ufon, ufor, upon, ures, urum, user, usic, utan, uten, uton, ycað, ycan, yced, yðum, yfel, ytum, ywan, ywaþ, ywed, yweð, yþum

And words with the shape C-V-C-V number 484:

bacu, baða, bæce, bæle, bære, bana, bare, baru, baþu, bega, bena, bene, bera, bere, bete, beþe, bide, bite, boca, boda, body, boga, bona, bote, bure, buta, bute, butu, byge, byme, byre, cafe, care, cele, cene, cepa, ciða, cile, come, cuðe, cuma, cume, cuþe, cyle, cyme, cymu, cyre, cyþe, dæda, dæde, dæge, dæle, dæne, ðæra, ðære, daga, dalu, ðane, ðara, dare, dege, dema, demæ, deme, dena, dene, ðere, ðine, dole, doma, dome, ðone, duna, dune, dura, dure, duru, dyde, ðyle, dyne, dyre, faca, fæce, fæge, fæla, fæle, fære, fana, fane, fara, fare, feða, feðe, fela, fele, fere, feþa, feþe, fife, fira, fire, five, fore, fota, fote, fula, fule, fuse, fyra, fyre, gara, gare, gatu, gedo, gena, geno, gere, geta, gife, gifu, gina, goda, gode, godu, gota, guðe, guma, gume, gute, guþe, gyfe, gyme, gyta, gyte, hada, hade, hæle, hælo, hælu, hæse, hæto, hafa, hafo, hafu, hale, hali, hama, hame, hara, hare, hata, hate, hefe, hege, hele, helo, here, hete, hewe, hige, hina, hine, hira, hire, hiwa, hiwe, hofe, hofu, hole, hopa, hope, horu, huðe, huga, huna, huru, husa, huse, huþa, huþe, hyde, hyðe, hyge, hyne, hyra, hyre, hyse, lace, laða, lade, laðe, læce, læde, læla, læne, lænu, lære, læte, lafe, lage, lago, lagu, lama, lame, lara, lare, lata, late, latu, laþe, lefe, lega, lege, lene, lete, lica, lice, lida, liða, lide, liðe, life, lige, lima, lime, liþe, liþu, locu, lofe, lufa, lufæ, lufe, lufu, lyfe, lyge, lyre, mæca, mæða, mæga, mæge, mæla, mæle, mæne, mæra, mære, mæro, mæru, mæte, maga, mage, mago, magu, mana, mane, mara, mare, meca, mece, meda, mede, meðe, medo, medu, melo, mere, mete, meþe, mide, mila, mine, moda, mode, modi, mona, more, mose, mote, muðe, muþa, muþe, myne, naca, næle, næni, nære, næte, nama, name, nane, neda, nede, nefa, nele, niða, niðe, nime, nine, niwe, niþa, niþe, noma, nose, noþe, nyde, nyle, race, racu, rade, raðe, ræda, ræde, raþe, rece, reða, reðe, rene, reþe, rica, rice, ricu, ride, rime, ripa, ripe, rode, rofe, rome, rope, rowe, rume, runa, rune, ryha, ryne, sace, sacu, sade, sæce, sæda, sæde, sæge, sæla, sæle, sæne, saga, sale, salo, salu, same, sara, saræ, sare, sari, sece, seðe, sefa, sege, sele, seme, sene, sete, sida, siða, side, siðe, sido, sige, sile, sina, sine, site, siþa, siþe, soða, soðe, some, sona, sone, soþa, soþe, sume, suna, suno, sunu, syle, syne, synu, sype, syre, syþe, tæle, tæso, tala, tale, tame, tane, tela, tene, tida, tiða, tide, tiðe, tila, tile, tima, tire, toða, tome, toþe, tuge, tyne, waca, wace, wada, wade, waðe, wado, wadu, waðu, wæda, wæde, wædo, wædu, wæge, wæle, wæra, wære, wæta, wage, wala, wale, walo, walu, wana, ware, waru, wega, wege, wela, wena, wene, wepe, wera, were, wese, wica, wida, wide, widu, wifa, wife, wiga, wige, wile, wina, wine, wire, wisa, wise, wita, wite, witu, woða, woma, wope, wora, woþa, wuda, wudu, wule, wuna, wyle, þæce, þæne, þæra, þære, þane, þara, þare, þine, þire, þone, þyle, þyre

Notice how many end with dative singular markers like –e. It suggests that we are looking at inflected forms of a C-V-C shape, one of the most common Proto-Indo-European root forms. Again, in the poetic corpus, there are 350 such tokens:

bad, bæc, bæd, bæð, bæg, bæl, bæm, bær, bam, ban, bat, bec, bed, beg, ben, bet, bid, bið, bil, bit, biþ, boc, boð, boh, bot, bur, byð, byþ, cam, can, cen, cer, ces, cið, col, com, con, cuð, cum, cuþ, cyð, cym, cyn, dæd, dæg, dæl, ðæm, ðær, ðæs, ðæt, ðah, ðam, ðan, ðar, ðas, day, ðec, deð, ðeð, ðeh, dem, ðem, ðer, ðes, ðet, deþ, dim, ðin, ðis, doð, dol, dom, don, ðon, dor, doþ, dun, ðus, dyn, ðyn, ðys, fæc, fær, fæt, fag, fah, fam, fan, far, fed, fel, fen, fet, fex, fif, fin, foh, fon, for, fot, ful, fus, fyl, fyr, fys, gad, gað, gæd, gæð, gæþ, gal, gan, gar, ged, gem, gen, get, gid, gif, gim, gin, git, god, guð, gyd, gyf, gyt, had, hæl, hær, hæs, hal, ham, har, hat, heg, heh, hel, her, het, hig, him, his, hit, hiw, hof, hoh, hol, hun, hus, hyd, hyð, hys, hyt, lac, lad, lað, læd, læf, læg, læn, lær, læs, læt, laf, lah, lar, laþ, lef, leg, len, let, lic, lid, lið, lif, lig, lim, lit, liþ, loc, lof, log, lot, lyt, mað, mæg, mæl, mæn, mæt, mæw, man, mec, men, mid, mið, min, mit, mod, mon, mor, mos, mot, muð, muþ, næs, nah, nam, nan, nap, nas, nat, neb, ned, neh, nes, nið, nim, nis, niþ, nom, non, num, nyd, nys, nyt, pyt, rad, ræd, ran, rec, ren, rex, rib, rim, rod, rof, rot, rum, run, ryn, sæd, sæl, sæm, sæp, sæs, sæt, sag, sah, sal, sar, sec, sel, sem, sib, sic, sid, sið, sin, sit, siþ, soð, sol, soþ, suð, sum, syb, syn, syx, syþ, tan, teð, teþ, tid, til, tin, tir, tor, tun, tyd, tyn, tyr, wac, wæf, wæg, wæl, wæn, wær, wæs, wæt, wag, wah, wan, was, wat, web, wed, weg, wel, wen, wep, wer, wes, wet, wic, wid, wið, wif, wig, win, wir, wis, wit, wiþ, woc, wod, woð, woh, wol, wom, won, wop, wyð, wyl, wyn, wyt, zeb, þæh, þæm, þær, þæs, þæt, þam, þan, þar, þas, þat, þec, þeh, þem, þer, þes, þet, þin, þis, þon, þus, þyð, þyn, þys

But the alternate, V-C-V has far fewer instances. I count 51:

ace, ada, aða, ade, aðe, ado, æce, æna, æne, æni, æra, æse, æte, æwæ, aga, age, ana, ane, ara, are, awa, awo, eca, ece, eci, eðe, ege, ele, ely, esa, ete, eþe, iða, ige, ipe, oga, ore, oxa, uðe, una, ura, ure, uta, ute, utu, uþe, yða, yðe, yne, yþa, yþe

269 of the CVC forms overlap with the CVCV forms, suggesting that

  1. CVCV forms that overlap with CVC forms are inflections of the root, or
  2. they are coincidentally similar and represent two different lexemes

As I continue to refine my OE Parser, I wonder whether employing PIE root forms might be useful in identifying lexemes. Certainly, when I turn to programming James E. Cathey’s tremendous diachronic phonology of Germanic languages, root form/shape will play an essential role. One of the methods I wrote in python checks for root form/shape, and I hoped to use it to identify spelling variants—allowing variation only in root vowels of a form. So: C-V(1)-C, C-V(2)-C, … C-V(n)-C.

Back to the crossword!

Update on a Parser-Tagger of Old English

Ottawa, Ontario
5 April 2019

Screen shot of Tagger

Method

Over the last eight months my approach to tagging an untagged sentence of Old English has been three-fold.
  1. First, I perform a simple look-up using four dictionaries (Bosworth-Toller, Clark-Hall, Bright’s glossary, a glossary of OE poetry from Texas), then save the results.
  2. Second and independently, I run the search-term through an inflectional search, returning the most likely part of speech based on suffixes and prefixes, then generate a weight based on whether or not that POS is closed-class or open-class. Those results are also saved.
  3. Third and finally, I check the search-term against a list of lemma that I compiled by combining Dictionary of Old English lemma and Bosworth-Toller lemma. If the lemmata is not found in the list of lemma, then I send it to an inflectional search, take the returned inflectional category and generate all possible forms, then search the list for one of those forms; if the form matches an existing lemma, then I break; if not, I do it again by taking the next most likely part of speech. Those results are also saved.

After these three steps run independently on the target sentence, I compare all three sets of saved results and weigh them accordingly. No possibilities are omitted until syntactic information can be adduced.

(Although I haven’t written it up, a search of the Helsinki Corpus might be useful as a fourth approach: if the term is parsed in the YCOE, that information could add to the weight of likelihood.)

Taking three approaches and comparing three sets of results is about 85% accurate.

Syntax

In order to improve the weights on the guesses, I’m writing a python class to guess syntactic patterns. I would like the class to examine the words without any information on their inflections or roots. The percentages here are not very good, but if you accumulate guesses, then accuracy improves (solving for theta). So, I look at
  1. the position of the term in a sentence. The percentages here are barely useful. If the term is in the first half of a prose sentence, then it is more likely than not (51%) to be something other than a verb or adverb. If the term is in the second half of the sentence, then it is more likely than not to be a verb or adverb. These percentages are discovered by parsing all sentences in the Corpus except those that derive from OE glosses on Latin—where underlying Latin word-order corrupts the data.
  2. its relative position with respect to closed-class words. These percentages are a little more useful.  For example, if the term follows a determiner, then it is more likely to be a noun or adjective than to be a verb.
  3. whether or not it is in a prepositional phrase and if so where. The word immediately following the preposition is likely either a noun or an adjective.
  4. whether or not it alliterates with other words in the sentence (OE alliteration tends to prioritize nouns, adjectives, and verbs).

The point of this python class is to come to a judgment about the part of speech of a term without looking it up in a wordlist. So far, a class meant to identify prepositional phrases works fairly well—I still need to deal with compound objects.

Screen shot of tagger with Prepositional Phrases

 

You’ll notice in the screenshot above that the tagger returns prepositional phrases. If you know python, you can see that highly likely tags are returned as strings and that less likely tags are returned in a list. This distinction in data types allows me to anticipate the syntax parser with a type() query. If type() == list, then ignore. You’ll notice that it has mischaracterized the last word, gereste, as a noun or adjective. It is a verb.

Next ?

The last step is to merge the two sets of weights together and select the most likely part of speech for a word. Since the result is so data-rich, it allows a user to search for syntactic patterns as well as for words, bigrams, trigrams, formulae, and so forth.

So, a user could search for all adjectives that describe cyning ‘king’ or cwen ‘queen’. Or find all adjectives that describe both. Or all verbs used of queens. Or how may prepositional phrases mention clouds.

Bigrams

9 March 2019. Puzzling out a word jumble, I’m writing a python script to search a grid for words. Step one is to compile a list of legal bigrams in English. Bigrams are two letters that go side-by-side. So the letter <Q> in English has a limited list of bigrams. We see <QU> as in quit, <QA> as in Qatar (and a few others if you allow very rare words).

I found a huge list online of English words compiled from web pages. 2.5 megs of text file! Here is the resulting python dict of bigrams:

{'A':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
'B':['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'G', 'P', 'Z', 'Q'],
'C':['A', 'I', 'K', 'T', 'U', 'E', 'O', 'Y', 'H', 'C', 'L', 'M', 'N', 'Q', 'R', 'S', 'D', 'B', 'W', 'Z', 'G', 'P', 'F'],
'D':['V', 'W', 'E', 'I', 'O', 'L', 'N', 'A', 'U', 'G', 'Y', 'R', 'P', 'C', 'D', 'F', 'H', 'J', 'M', 'S', 'T', 'Z', 'B', 'K', 'Q'],
'E':['H', 'R', 'D', 'N', 'E', 'S', 'M', 'Y', 'V', 'L', 'A', 'C', 'I', 'P', 'T', 'K', 'Z', 'U', 'G', 'W', 'B', 'F', 'O', 'X', 'Q', 'J'],
'F':['F', 'T', 'A', 'U', 'O', 'E', 'I', 'Y', 'L', 'G', 'R', 'S', 'W', 'Z', 'N', 'V', 'H', 'B', 'K', 'D', 'M', 'J', 'P', 'C'],
'G':['I', 'E', 'H', 'L', 'N', 'A', 'Y', 'O', 'R', 'M', 'U', 'S', 'D', 'G', 'K', 'P', 'B', 'W', 'T', 'F', 'C', 'V', 'J', 'Z'],
'H':['R', 'E', 'L', 'M', 'I', 'Y', 'O', 'U', 'A', 'T', 'N', 'S', 'W', 'B', 'P', 'Z', 'G', 'C', 'F', 'D', 'H', 'J', 'K', 'V', 'Q'],
'I':['C', 'T', 'N', 'S', 'O', 'E', 'A', 'Z', 'R', 'L', 'D', 'U', 'P', 'G', 'B', 'V', 'F', 'M', 'I', 'X', 'K', 'Y', 'W', 'H', 'Q', 'J'],
'J':['E', 'O', 'U', 'A', 'I', 'H', 'J', 'R', 'Y', 'P', 'D', 'M', 'W', 'L', 'T', 'N', 'B', 'K'],
'K':['A', 'H', 'E', 'I', 'Z', 'M', 'N', 'B', 'S', 'L', 'O', 'C', 'K', 'P', 'R', 'T', 'U', 'W', 'Y', 'D', 'F', 'G', 'J', 'V'],
'L':['F', 'L', 'U', 'I', 'O', 'E', 'Y', 'A', 'M', 'T', 'S', 'N', 'V', 'C', 'D', 'B', 'G', 'H', 'P', 'R', 'K', 'W', 'J', 'Q', 'Z', 'X'],
'M':['A', 'P', 'E', 'B', 'I', 'O', 'H', 'U', 'Y', 'M', 'S', 'T', 'F', 'L', 'W', 'N', 'R', 'C', 'G', 'V', 'K', 'D', 'J', 'Z', 'Q'],
'N':['I', 'A', 'C', 'E', 'D', 'T', 'U', 'O', 'S', 'R', 'G', 'Y', 'M', 'N', 'Z', 'L', 'P', 'K', 'F', 'H', 'Q', 'B', 'J', 'V', 'X', 'W', '-'],
'O':['L', 'N', 'R', 'S', 'I', 'M', 'T', 'U', 'G', 'O', 'W', 'A', 'B', 'D', 'H', 'V', 'X', 'C', 'K', 'Z', 'P', 'Y', 'E', 'F', 'Q', 'J'],
'P':['E', 'T', 'O', 'Y', 'I', 'H', 'S', 'R', 'A', 'N', 'U', 'L', 'P', 'M', 'J', 'B', 'D', 'F', 'W', 'K', 'C', 'G', 'V', 'Q'],
'Q':['U', 'I', 'A', 'R', 'E', 'O', 'Q'],
'R':['D', 'O', 'U', 'E', 'A', 'I', 'T', 'Y', 'R', 'S', 'V', 'M', 'B', 'P', 'G', 'N', 'H', 'L', 'F', 'W', 'C', 'K', 'J', 'Q', 'X', 'Z'],
'S':['C', 'T', 'A', 'E', 'S', 'I', 'G', 'H', 'K', 'O', 'M', 'U', 'F', 'Q', 'V', 'Y', 'P', 'L', 'N', 'B', 'W', 'D', 'R', 'J', 'Z'],
'T':['E', 'I', 'O', 'H', 'A', 'T', 'U', 'C', 'N', 'S', 'R', 'M', 'L', 'Y', 'B', 'P', 'F', 'W', 'K', 'Z', 'D', 'G', 'J', 'V', 'Q', 'X'],
'U':['A', 'S', 'L', 'R', 'C', 'M', 'N', 'D', 'T', 'E', 'V', 'P', 'Z', 'B', 'I', 'O', 'X', 'G', 'K', 'F', 'Y', 'W', 'J', 'H', 'Q', 'U'],
'V':['A', 'E', 'I', 'O', 'U', 'Y', 'S', 'R', 'C', 'L', 'V', 'N', 'Z', 'D', 'K', 'G'],
'W':['O', 'H', 'A', 'E', 'I', 'L', 'N', 'S', 'T', 'R', 'M', 'U', 'Y', 'B', 'P', 'W', 'D', 'F', 'K', 'C', 'G', 'Z', 'Q', 'V', 'J'],
'X':['I', 'A', 'Y', 'T', 'E', 'O', 'U', 'M', 'P', 'C', 'B', 'F', 'H', 'L', 'S', 'W', 'R', 'D', 'K', 'N', 'G', 'Q', 'Z', 'V'],
'Y':['S', 'M', 'A', 'R', 'C', 'P', 'G', 'I', 'L', 'N', 'D', 'T', 'X', 'O', 'E', 'Z', 'U', 'F', 'W', 'H', 'B', 'Y', 'K', 'V', 'J', 'Q'],
'Z':['E', 'A', 'U', 'Z', 'I', 'O', 'L', 'G', 'Y', 'R', 'H', 'T', 'N', 'B', 'D', 'P', 'K', 'C', 'M', 'V', 'S', 'F', 'W']
}

And here is the code to get the bigrams (my file of words is called web2.txt, and each word is on a separate line). In order to limit the bigrams to a list of unique letters, I use set().

import os

path = os.getcwd()
path += '/web2.txt'

bigrams = {'A':[], 'B':[], 'C':[], 'D':[], 'E':[], 'F':[], 'G':[], 'H':[], 'I':[], 'J':[],
           'K':[], 'L':[], 'M':[], 'N':[], 'O':[], 'P':[], 'Q':[], 'R':[], 'S':[], 'T':[],
           'U':[], 'V':[], 'W':[], 'X':[], 'Y':[], 'Z':[]}

with open(path, 'r') as allwords:
    words = allwords.read().split('\n')
    allwords.close()

for letter in bigrams.keys():
    letter = letter.upper()

    for word in words:
        word = word.upper()
        if letter in word:
            if word.index(letter) < len(word):
                try:
                    nextletter = word[word.index(letter) + 1]
                    if nextletter not in set(bigrams[letter]):
                        bigrams[letter].append(nextletter)
                except IndexError:
                    continue

    print('\'{0}\':{1}, '.format(letter, bigrams[letter]))

Bigrams

March 6. An interim step in making a semantic map of Old English is producing bigrams. Bigrams are pairs of words. In order to build a social network of words, you need to know which words connect to one another. For example, in Beowulf, the word wolcnum ‘clouds’ almost always sits next to under ‘under’.

By comparison, the epic poem Judith has no clouds in it. And the homilist Ælfric never uses the phrase under wolcnum.

Here is a screen shot of words that follow ic ‘I’ in the poem Beowulf. So, the first is “ic nah.”

You can see that there are 181 instances of ic, although only 80 are unique. In other words, some bigrams are repeated. The second word of the bigram is printed again in red, and passed to a part-of-speech tagger. The blue text is the tagger’s best guess, and it also returns the part-of-speech most cited by dictionaries. As I plan to discuss in an article, ic is very rarely followed by a verb.

We can discover a great deal about poetic style by looking very closely at the grammar of Old English poetry. The grammar is the unfolding in time of images and ideas and asides and so forth. Grammar describes how the words affect you in order as you read.

About three-quarters there

Screen shot 12/2/2018.You’re only as good as your data

That is the lesson here. Single brackets [x] indicate an entry in Ondrej Tichy‘s Bosworth-Toller, which I edited into a json file. Double brackets [[x]] indicate an entry in the raw data of Ondrej’s BT, if the word wasn’t found in the json file. Empty brackets indicate no returned value. A word like mæg can mean ‘may’ (V) or ‘kin’ (N). The word didn’t make the structured data, and the raw data mischaracterized it in its verbal form, so the parser didn’t pick up the verb.

Rather than spend days improving the data from Bosworth-Toller, or overwhelm the servers in Prague with BeautifulSoup requests, I’m going to scrape word lists from Old English sites, and OCR some glossaries from freely-available books. If I can compile 10 or 20 word lists and zip them to grammatical information, I can get a percentage of likelihood for any given word. Second, I can use the York-Helsinki Parsed Corpus of Aelfric’s prose through CLTK. It won’t catch all of the words, but might be a help.

I’ve written a simple script to inflect any noun or adjective and to conjugate any verb. I can work it backwards to find the root form of a word, then send that to BT.

Final step is to run the words and forms through a syntactic parser. If it sees ne, which carries a weight of 5, then it increases the likelihood that the next word is a verb, since negative particles almost always sit next to verbs in OE. (One can check that with a bigram search.) Similar proximity searches to prepositions, pronouns, and so forth help to assess weights (probabilities).

Once this next layer is completed, and the weights adjusted, I will have a decent control to check the more experimental parser.

Poetic Words

sort | uniq

Has anyone has done this since Angus Cameron suggested it in 1973? I separated the Corpus of Old English into genres and sub-genres. It enabled me to find words unique to poetry. The poetic texts are largely from the ASPR, but include Chronicle poems, the Meters of Boethius, and others.

First, I sorted the words into alphabetical order and removed duplicates. Second, I did the same for all prose texts. I also removed all foreign words from the prose texts—those are words that the Dictionary of Old English designated as foreign by placing them within <foreign> tags. Third, I compared prose words with poetic words. The resulting list is a set of all words used only in the poetic texts. Here is the file (right-click to download): PoeticWords

The next step is to classify each word by word class. That will allow me to differentiate verbal phrases from noun phrases in the poetry. Once noun phrases are isolated, I can begin to build a semantic map of poetic discourse in Old English. Afterwards, I’ll add verb phrases. So we’ll be able to know how OE poets described queens (adjectives) and what sort of acts queens performed (verbs), and compare that to descriptions of kings and the acts they performed. We can then further differentiate dryhten from cyning, and cwen from ides. But there’s a big caveat.

Because Old English poets wrote alliterative verse, adjectives and verbs may have been chosen simply on account of their initial sound. So, cwen may have attracted /k/-initial words. That is why it is essential to also build a map in prose of cwen. Since the formal structure of prose was not governed by alliteration (with the possible exception of Ælfric), the map in prose and the map in poetry of any given noun might well be distinct.

Fulbright Project

View from Dunton Tower at Carleton University looking north along the Rideau River towards the city of Ottawa.

I am very fortunate this year to have received a Fulbright award. The College of Humanities and Fine Arts at UMass made it possible for me to spend the academic year at Carleton University in Ottawa, Ontario, Canada. While here, I’m working on a natural-language parser of Old English, which I will use to create a semantic map of Old English nouns. In short, I want a computer to recognize an Old English noun and then find all words associated with it. Nouns are names for entities in the world. So a semantic map tells us something about how a language permits people to associate qualities with entities.

Following in the footsteps of Artificial Intelligence researchers like Waleed Ammar of the Paul Allen Institute, I will be using untagged corpora—that is, texts that no one has marked up for grammatical information. I would like to interfere with the data as little as possible.

What makes this project different from similar NLP projects is my aim. I want to produce a tool that can be used by literary critics. I am not interested in improving Siri or Alexa or a pop-up advertisement that wants to sell you shoes. Neither is my aim to propose hypotheses about natural languages, which is a general aim of linguistics-related NLPs. So, the object of my inquiry is artful writing, consciously patterned language.

STAGE ONE

The first stage is to write a standard NLP parser using tagged corpora. The standard parser will serve to check any results of the non-standard parser. Thanks to the generosity of Dr. Ondrej Tichý of Charles University in Prague, the standard parser is now equipped with a list of OE lexemes, parsed for form. A second control mechanism is the York-Helsinki Parsed Corpus of Old English, which is a tagged corpus of most of Aelfric’s Catholic sermons.

STAGE TWO

At the same time, I divided the OE corpus into genres. In poetic texts, dragons can breathe fire. But in non-fictional texts, dragons don’t exist. So a semantic field drawn around dragons will change depending on genre. I am subdividing the poetry according to codex, and then according to age (as far as is possible) to account for semantic shift. Those subdivisions will have to be revised, then abandoned as the AI engine gets running. (I’ll be using the python module fastai to implement an AI.)

Notes

Unicode. You’d think that searching a text string for another text string would be straightforward. But nothing is easy! A big part of preparing the Old English Corpus for manipulation is ensuring that the bits and bytes are in the right order. I had a great deal of difficulty opening the Bosworth-Toller structured data. It was in UTF-16 encoding, similar to the character encoding used on Windows machines. When I tried to open it via python, the interpreter threw an error. It turns out, Unicode is far more complex than one imagines. Although I can find ð and þ, for example, I cannot find them as the first characters of words after a newline (or even regex \b). Another hurdle.

Overcame it! The problem was in the data. For reasons unknown to me, Microsoft Windows encodes runic characters differently than expected. So the solution was to use a text editor (BB Edit), go into the data, and replace all original thorns with regular thorns. Same for eth, asc, and so forth. Weirdly, it didn’t look like I was doing anything: both thorns looked identical on the screen.

 

Screen shot of parser guts so far. Markup is data from Tichy’s Bosworth-TollerInflect receives Markup, then returns inflections based on the gender of strong nouns. Variants and special cases have not yet been included.

To finish STAGE ONE, I’ll now inflect every noun, pronoun, and adjective, then conjugate every verb as they come in (on the fly). Andrej Tichý at Charles University in Prague, who very generously sent me his structured data, took a slightly different approach: he generated all the permutations first and placed them into a list. Finally, as a sentence comes in, I’ll send off each word to the parser, receive its markup and inflections/conjugations, then search the markup for matches.

Square Roots

My daughter and I were recently playing with python’s square root function. She discovered that if you evaluate an even number of ones, the square root is half that number of three’s on both sides of the decimal. So √11 is approximately 3.3, and the √1111 is approximately 33.33, and so forth. We learned that this continues until there are eight three’s on either side of the decimal point, then they reduce in frequency.

The square root of an odd number of ones is also patterned. √1 is 1, √111 is 10.5, √11111 is 105.4, √1111111 is 1054.0, and so forth.

So we decided to write a python program to generate 20 instances. Here is the program:

#! /usr/bin/env python3
“””Determines the square roots of numbers comprised of ones like 11, 111, 1111, etc.”””
import math
bobby = 10
sue = 1
for x in range(1,20):
   answer = bobby + sue 
   sue = answer
   bobby = bobby*10
   print(x+1, ‘\tThe square root of ‘, answer, ‘ is ‘, math.sqrt(answer))

And here are the answers:

2 The square root of  11  is  3.3166247903554
3 The square root of  111  is  10.535653752852738
4 The square root of  1111  is  33.331666624997915
5 The square root of  11111  is  105.40872829135166
6 The square root of  111111  is  333.333166666625
7 The square root of  1111111  is  1054.0925006848308
8 The square root of  11111111  is  3333.3333166666666
9 The square root of  111111111  is  10540.925528624135
10 The square root of  1111111111  is  33333.33333166667
11 The square root of  11111111111  is  105409.25533841894
12 The square root of  111111111111  is  333333.33333316667
13 The square root of  1111111111111  is  1054092.553389407
14 The square root of  11111111111111  is  3333333.3333333167
15 The square root of  111111111111111  is  10540925.533894593
16 The square root of  1111111111111111  is  33333333.333333332
17 The square root of  11111111111111111  is  105409255.33894598
18 The square root of  111111111111111111  is  333333333.3333333
19 The square root of  1111111111111111111  is  1054092553.3894598
20 The square root of  11111111111111111111  is  3333333333.3333335

Although it looks like the sixes also multiply, they also reduce after reaching eight in a row. Check it out with python’s decimal package. from decimal import Decimal, then in the print statement, add Decimal(math.sqrt(answer)).


Free Will

Wednesday 6 December: a very exciting discussion about free will put on by the Erasmus Center. Sincere thanks to Jim Holden and to Erasmus for inviting me to respond to Peter Tse, author of The Neural Basis of Free Will (MIT Press, 2013).

My main point during the debate was that standards of proof and acceptable methods of testing are not yet available to neuro-scientists to establish a physiological basis of free will. Study of the neuron is the province of bio-chemistry, which has its own standards of proof and acceptable methods of testing. These standards have been developed over decades, through argument and counter-argument, and through experimentation. They are not optional—not if you seek accurate results. Freedom is a concept discussed for centuries by philosophers, theologians, political scientists, and historians. Each of those fields has its own standards of proof and acceptable methods of argumentation. Those standards are important to ensuring logical results. Will or volition is chiefly the province of psychology, with its own standards of proof and acceptable methods of testing. So bringing bio-chemical evidence to a philosophical debate about a psychological topic seems to me to be like, as Laurie Anderson said, trying to dance architecture.

A secondary point I made was that any logical investigation proceeds from the question that you set. So, setting the question correctly is essential. We would not have had a debate had Dr. Tse written a book entitled, The Neural Basis of Unconstrained Choice. The phrase “free will” connotes something in English that the phrases “unconstrained choice” or “unfettered desire” do not. So, I tried to show how desire is different from will in English, how French and Latin are different again, and how investigating free will in English entails different logical assumptions than investigating it in French or Latin. In English, will connotes desire, want, action. In French, arbitre connotes sight, judgment, observation. Different semantic fields with little overlap. Another example: the greatest virtue according to Christians is love. That’s English. In the Latin Bible, the word is caritas. You can also translate caritas as charity (faith, hope, and charity). You can give charity without being in love, such as for tax purposes. So which one is the virtue? Faith in the Latin Bible is fides, which can also be translated as loyalty. Which is it? There’s a big difference between obeying someone that you don’t believe in and believing someone whom you don’t obey. Same for freedom. The French prize Liberté, or liberty. Would Dr. Tse have found the same things if he looked for liberty of desire? I don’t think so.

I also made the case that, as Gertrude Stein said of Oakland, “there’s no there there.” “Free will” is a concept that English speakers use to talk about a whole host of connected ideas and psychological processes. Free will is not a thing. It doesn’t exist the way Plymouth Rock or the Boston Marathon exist. Where do you find free will? I say, in a dictionary.

The public discussion among the guests afterwards was terrific. No one in the room doubted that the brain is essential to thinking. But there seemed a general consensus that thought is not reducible to bio-chemistry. Some people made the point that our morality and personal values depend upon a non-reductive view, on a non-physicalist view, of will. Others said that there are psychological responses that we think are free, but are actually conditioned or instinctive. So we have to distinguish the choices that are free from those that are not. Others asked whether or not free will introduces randomness into science, and if so, to what degree. (I tend to think that decisions are not made randomly, but on the basis of stochastic algorithms that measure optimality by accounting for values, external conditions, imagined results, and so forth.) What was most apparent to me is that neuro-science is not going to trump dozens of disciplines, centuries of carefully thought-out positions, and carefully considered, methodical experimentation. It reaffirmed my faith in the multiplicity of a university, of a fundamental need for diversity of viewpoints, all speaking with each other, with each one grounded in a distinct intellectual tradition.