The Best Pop Song Ever!

I found the top 50 words in all the #1 pop hits listed in Billboard magazine from 1965 to 2015.

Unfiltered (including prepositions, conjunctions, etc): [(‘you’, 8353), (‘i’, 7882), (‘the’, 7083), (‘me’, 4277), (‘and’, 4068), (‘a’, 3873), (‘to’, 3869), (‘it’, 3796), (‘my’, 3529), (‘in’, 2650), (‘im’, 2593), (‘that’, 2431), (‘on’, 2224), (‘your’, 2091), (‘up’, 2085), (‘like’, 2054), (‘oh’, 2013), (‘all’, 1759), (‘we’, 1726), (‘dont’, 1619), (‘love’, 1544), (‘be’, 1509), (‘of’, 1471), (‘know’, 1434), (‘for’, 1400), (‘so’, 1308), (‘but’, 1303), (‘is’, 1288), (‘got’, 1282), (‘with’, 1258), (‘just’, 1229), (‘this’, 1197), (‘baby’, 1113), (‘when’, 1073), (‘get’, 1059), (‘its’, 994), (‘no’, 989), (‘now’, 947), (‘yeah’, 944), (‘what’, 933), (‘youre’, 923), (‘can’, 916), (‘go’, 906), (‘if’, 885), (‘do’, 870), (‘wanna’, 858), (‘down’, 826), (’cause’, 769), (‘out’, 767), (‘make’, 746)]

Filtered:

im, like, oh, dont, love, know, got, baby, get, yeah, youre, go, wanna, cause, make, want, girl, never, one, let, see, gonna, aint, cant, la, come, ill, back, time, feel

Then I put them together into (semi-)meaningful lyrics.

I'm like, oh don't love know

I got a baby, yeah

You're gonna wanna want me

Cause you make me, girl,

Never be the one to let me see

I ain't the one

Can't be —la la la—

the one to come back.

I'll take time to feel it.

Semantics in the Old English Poetic Line

The case of Maldon

During an Independent Study on the Battle of Maldon this week, we noticed that in lines 109 and 110, the weapons of war were named in the third lift. The verbs were in the fourth:

grimme gegrundene garas fleogan

bogan wæran bysige bord ord onfeng.

A lift is another term for one of the four heavily weighted syllables in the OE poetic line. Whether some have more semantic force than others is a question raised by Professor Smirnitskaya of Moscow State University. Her student, Dr. Ilya Sverdlov of the Helsinki Institute of Advanced Study, gave a terrific paper on lifts and semantic force here at UMass many years ago.

I wrote a program to extract the third lift from every line. Recall that every OE poetic line has four major stresses, a caesura between the second and third stress, and alliteration across the caesura.

The program is in Python. For each line of the poem:

  • I remove OE stop-words
  • divide a line into half-lines (called the a-line and the b-line)
  • take the first letter of each word in the a-line in order to establish a pattern of alliteration in the b-line
  • return the third lift

At the moment, the third lift is unformatted. But I’d like to format it in color if it’s an alliterated lift. That’s for later. Also for later is adding some functionality so that this program can retrieve any lift from any poem along with the part of speech of that lift (e.g. “garas”, noun plural). First, the results. Then, the code. NB. Some of the results are inaccurate—I’ve marked those with a Kleene star.

And now the code.

First, the list of stop-words in Old English:

If you would like the formatted text of Maldon please write me.

Block China on UNIX

There are some very good scripts to modify IP Tables that will block unwanted traffic from getting anywhere with UMass servers. Some of the most aggressive hacking attempts are coming from China. Here is an automated hacking script trying to connect to my UMass box through ssh about every two seconds. Note the fake user names (Viktor, tobyliu, Avignon-123, root). You can see the originating server’s address by checking it out in whois. Some of these attempts are proxied through digital ocean and others, one comes from Ravna Gora, Croatia (195.29.105.125).

Oct 23 16:04:20 sshd: Received disconnect from 45.55.177.230 port 53758:11: Bye Bye 
Oct 23 16:04:20 sshd: Disconnected from invalid user viktor 45.55.177.230 port 53758 
Oct 23 16:05:32 sshd: refused connect from 218.92.0.204 (218.92.0.204)
Oct 23 16:05:34 sshd: Invalid user tobyliu from 129.158.73.119 port 23191
Oct 23 16:05:34 sshd: pam_unix(sshd:auth): check pass; user unknown
Oct 23 16:05:34 sshd: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=129.158.73.119
Oct 23 16:05:36 sshd: Failed password for invalid user tobyliu from 129.158.73.119 port 23191 ssh2
Oct 23 16:05:36 sshd: Received disconnect from 129.158.73.119 port 23191:11: Bye Bye 
Oct 23 16:05:36 sshd: Disconnected from invalid user tobyliu 129.158.73.119 port 23191 
Oct 23 16:05:44 sshd: Invalid user Avignon-123 from 1.203.115.64 port 54593
Oct 23 16:05:44 sshd: pam_unix(sshd:auth): check pass; user unknown
Oct 23 16:05:44 sshd: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=1.203.115.64
Oct 23 16:05:47 sshd: Failed password for invalid user Avignon-123 from 1.203.115.64 port 54593 ssh2
Oct 23 16:05:51 sshd: User root from 195.29.105.125 not allowed because not listed in AllowUsers
Oct 23 16:05:51 sshd: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=195.29.105.125 user=root
Oct 23 16:05:53 sshd: Failed password for invalid user root from 195.29.105.125 port 47984 ssh2
Oct 23 16:05:53 sshd: Received disconnect from 195.29.105.125 port 47984:11: Bye Bye 
Oct 23 16:05:53 sshd: Disconnected from invalid user root 195.29.105.125 port 47984 
Oct 23 16:06:15 sshd: refused connect from 218.92.0.204 (218.92.0.204)
Oct 23 16:06:52 sshd: refused connect from 218.92.0.204 (218.92.0.204)

Notice that some of the attempts were automatically refused. These came from addresses that I added to auth.log. To add this layer of protection, you can use a combination of python and bash scripting to add entries to the hosts.deny file. IMPORTANT: this only works on a single-user server. It assumes that all connection attempts are phony. I run this python script when I am on my linux box since I know I’m not trying to connect to it from off-site. You can make it part of your daily or hourly cron daemons. It takes the last 30 lines of auth.log, parses it for IP addresses, reformats them, then adds them to hosts.deny.

#! /usr/bin/env python3

import subprocess
import os
import re

result = subprocess.run(['tail', '-n 30', '/var/log/auth.log'], stdout=subprocess.PIPE)
str_result = result.stdout.decode('utf-8')

auth_parts = str_result.split(" ")
chinaips = []

for thisone in auth_parts:
    if re.match(r"(\d\d\d\.)", thisone):
        print("Found one: ", thisone)
         chinaips.append(thisone)

chinaips = list(set(chinaips))
print(chinaips)

# overwrite existing ips
os.system('echo "" > chinaips.txt')

with open("chinaips.txt", "a") as fh:
    ipstowrite = ["ALL:"+chinaips[i]+"\n" for i in range(0, len(chinaips))]
    ipstowrite_str = ''.join(ipstowrite)
    fh.write(ipstowrite_str)
    fh.close()

# append these ips to etc/hosts.deny
os.system('cat chinaips.txt | sudo tee -a /etc/hosts.deny')

Old English word shape

Crosswords are intriguing programming problems. How do you generate a New-York-Times-style crossword puzzle from a list of words? During an attempt (in Old English of course), I noticed some very interesting features of Old English words.

Consider the upper-left section of the crossword puzzle. An easy solution to generating words is to map the phonological shape of words before searching for instances of that shape in a list. So, an easy shape is consonants (C) and vowels (V) alternating. If 1-across is C-V-C-V, then the next word down, 13-across, is V-C-V-C. Then, you search the list of OE 4-letter words for that shape.

Say 1-across is C-V-C-V, bana ‘murderer’. 13-across might be V-C-V-C, eþel ‘homeland’. That sets up 1-down to start with be– and 2-down to start with æþ-. You’d think there would be plenty of words to fit that scheme.

But there are not! After extracting all words from the poetic corpus of Old English, I divided them into 3-, 4-, 5-, 6-, and 7-letter word lists. There are 133 tokens (words rather than lexemes) with the shape V-C-V-C:

abæd, abal, aban, aber, abit, acas, acol, acul, acyr, adam, adan, aðas, ades, adon, aðum, æcer, æðel, æfen, æfyn, ægen, æled, æleð, ænig, ænyg, æren, æres, æror, ærur, ætan, æten, ætes, æton, ætyw, æxum, afyr, agæf, agan, agar, agef, agen, agif, agof, agol, agon, agun, ahef, ahof, ahon, alæd, alæg, alæt, ales, alyf, alys, amæt, amen, amet, amor, anes, anum, arað, aras, ares, arim, aris, arod, arum, atol, aweg, awer, awoc, awox, axan, aþas, ecan, eces, ecum, eðan, eðel, edom, efen, enoc, enos, eror, etan, eteð, evan, eþel, ican, iceð, idel, ides, iren, isac, isen, isig, oðer, ofæt, ofen, ofer, ofet, ofir, ofor, onet, open, oreb, oroð, oruð, ower, oxan, oþer, ufan, ufon, ufor, upon, ures, urum, user, usic, utan, uten, uton, ycað, ycan, yced, yðum, yfel, ytum, ywan, ywaþ, ywed, yweð, yþum

And words with the shape C-V-C-V number 484:

bacu, baða, bæce, bæle, bære, bana, bare, baru, baþu, bega, bena, bene, bera, bere, bete, beþe, bide, bite, boca, boda, body, boga, bona, bote, bure, buta, bute, butu, byge, byme, byre, cafe, care, cele, cene, cepa, ciða, cile, come, cuðe, cuma, cume, cuþe, cyle, cyme, cymu, cyre, cyþe, dæda, dæde, dæge, dæle, dæne, ðæra, ðære, daga, dalu, ðane, ðara, dare, dege, dema, demæ, deme, dena, dene, ðere, ðine, dole, doma, dome, ðone, duna, dune, dura, dure, duru, dyde, ðyle, dyne, dyre, faca, fæce, fæge, fæla, fæle, fære, fana, fane, fara, fare, feða, feðe, fela, fele, fere, feþa, feþe, fife, fira, fire, five, fore, fota, fote, fula, fule, fuse, fyra, fyre, gara, gare, gatu, gedo, gena, geno, gere, geta, gife, gifu, gina, goda, gode, godu, gota, guðe, guma, gume, gute, guþe, gyfe, gyme, gyta, gyte, hada, hade, hæle, hælo, hælu, hæse, hæto, hafa, hafo, hafu, hale, hali, hama, hame, hara, hare, hata, hate, hefe, hege, hele, helo, here, hete, hewe, hige, hina, hine, hira, hire, hiwa, hiwe, hofe, hofu, hole, hopa, hope, horu, huðe, huga, huna, huru, husa, huse, huþa, huþe, hyde, hyðe, hyge, hyne, hyra, hyre, hyse, lace, laða, lade, laðe, læce, læde, læla, læne, lænu, lære, læte, lafe, lage, lago, lagu, lama, lame, lara, lare, lata, late, latu, laþe, lefe, lega, lege, lene, lete, lica, lice, lida, liða, lide, liðe, life, lige, lima, lime, liþe, liþu, locu, lofe, lufa, lufæ, lufe, lufu, lyfe, lyge, lyre, mæca, mæða, mæga, mæge, mæla, mæle, mæne, mæra, mære, mæro, mæru, mæte, maga, mage, mago, magu, mana, mane, mara, mare, meca, mece, meda, mede, meðe, medo, medu, melo, mere, mete, meþe, mide, mila, mine, moda, mode, modi, mona, more, mose, mote, muðe, muþa, muþe, myne, naca, næle, næni, nære, næte, nama, name, nane, neda, nede, nefa, nele, niða, niðe, nime, nine, niwe, niþa, niþe, noma, nose, noþe, nyde, nyle, race, racu, rade, raðe, ræda, ræde, raþe, rece, reða, reðe, rene, reþe, rica, rice, ricu, ride, rime, ripa, ripe, rode, rofe, rome, rope, rowe, rume, runa, rune, ryha, ryne, sace, sacu, sade, sæce, sæda, sæde, sæge, sæla, sæle, sæne, saga, sale, salo, salu, same, sara, saræ, sare, sari, sece, seðe, sefa, sege, sele, seme, sene, sete, sida, siða, side, siðe, sido, sige, sile, sina, sine, site, siþa, siþe, soða, soðe, some, sona, sone, soþa, soþe, sume, suna, suno, sunu, syle, syne, synu, sype, syre, syþe, tæle, tæso, tala, tale, tame, tane, tela, tene, tida, tiða, tide, tiðe, tila, tile, tima, tire, toða, tome, toþe, tuge, tyne, waca, wace, wada, wade, waðe, wado, wadu, waðu, wæda, wæde, wædo, wædu, wæge, wæle, wæra, wære, wæta, wage, wala, wale, walo, walu, wana, ware, waru, wega, wege, wela, wena, wene, wepe, wera, were, wese, wica, wida, wide, widu, wifa, wife, wiga, wige, wile, wina, wine, wire, wisa, wise, wita, wite, witu, woða, woma, wope, wora, woþa, wuda, wudu, wule, wuna, wyle, þæce, þæne, þæra, þære, þane, þara, þare, þine, þire, þone, þyle, þyre

Notice how many end with dative singular markers like –e. It suggests that we are looking at inflected forms of a C-V-C shape, one of the most common Proto-Indo-European root forms. Again, in the poetic corpus, there are 350 such tokens:

bad, bæc, bæd, bæð, bæg, bæl, bæm, bær, bam, ban, bat, bec, bed, beg, ben, bet, bid, bið, bil, bit, biþ, boc, boð, boh, bot, bur, byð, byþ, cam, can, cen, cer, ces, cið, col, com, con, cuð, cum, cuþ, cyð, cym, cyn, dæd, dæg, dæl, ðæm, ðær, ðæs, ðæt, ðah, ðam, ðan, ðar, ðas, day, ðec, deð, ðeð, ðeh, dem, ðem, ðer, ðes, ðet, deþ, dim, ðin, ðis, doð, dol, dom, don, ðon, dor, doþ, dun, ðus, dyn, ðyn, ðys, fæc, fær, fæt, fag, fah, fam, fan, far, fed, fel, fen, fet, fex, fif, fin, foh, fon, for, fot, ful, fus, fyl, fyr, fys, gad, gað, gæd, gæð, gæþ, gal, gan, gar, ged, gem, gen, get, gid, gif, gim, gin, git, god, guð, gyd, gyf, gyt, had, hæl, hær, hæs, hal, ham, har, hat, heg, heh, hel, her, het, hig, him, his, hit, hiw, hof, hoh, hol, hun, hus, hyd, hyð, hys, hyt, lac, lad, lað, læd, læf, læg, læn, lær, læs, læt, laf, lah, lar, laþ, lef, leg, len, let, lic, lid, lið, lif, lig, lim, lit, liþ, loc, lof, log, lot, lyt, mað, mæg, mæl, mæn, mæt, mæw, man, mec, men, mid, mið, min, mit, mod, mon, mor, mos, mot, muð, muþ, næs, nah, nam, nan, nap, nas, nat, neb, ned, neh, nes, nið, nim, nis, niþ, nom, non, num, nyd, nys, nyt, pyt, rad, ræd, ran, rec, ren, rex, rib, rim, rod, rof, rot, rum, run, ryn, sæd, sæl, sæm, sæp, sæs, sæt, sag, sah, sal, sar, sec, sel, sem, sib, sic, sid, sið, sin, sit, siþ, soð, sol, soþ, suð, sum, syb, syn, syx, syþ, tan, teð, teþ, tid, til, tin, tir, tor, tun, tyd, tyn, tyr, wac, wæf, wæg, wæl, wæn, wær, wæs, wæt, wag, wah, wan, was, wat, web, wed, weg, wel, wen, wep, wer, wes, wet, wic, wid, wið, wif, wig, win, wir, wis, wit, wiþ, woc, wod, woð, woh, wol, wom, won, wop, wyð, wyl, wyn, wyt, zeb, þæh, þæm, þær, þæs, þæt, þam, þan, þar, þas, þat, þec, þeh, þem, þer, þes, þet, þin, þis, þon, þus, þyð, þyn, þys

But the alternate, V-C-V has far fewer instances. I count 51:

ace, ada, aða, ade, aðe, ado, æce, æna, æne, æni, æra, æse, æte, æwæ, aga, age, ana, ane, ara, are, awa, awo, eca, ece, eci, eðe, ege, ele, ely, esa, ete, eþe, iða, ige, ipe, oga, ore, oxa, uðe, una, ura, ure, uta, ute, utu, uþe, yða, yðe, yne, yþa, yþe

269 of the CVC forms overlap with the CVCV forms, suggesting that

  1. CVCV forms that overlap with CVC forms are inflections of the root, or
  2. they are coincidentally similar and represent two different lexemes

As I continue to refine my OE Parser, I wonder whether employing PIE root forms might be useful in identifying lexemes. Certainly, when I turn to programming James E. Cathey’s tremendous diachronic phonology of Germanic languages, root form/shape will play an essential role. One of the methods I wrote in python checks for root form/shape, and I hoped to use it to identify spelling variants—allowing variation only in root vowels of a form. So: C-V(1)-C, C-V(2)-C, … C-V(n)-C.

Back to the crossword!

Update on a Parser-Tagger of Old English

Ottawa, Ontario
5 April 2019

Screen shot of Tagger

Method

Over the last eight months my approach to tagging an untagged sentence of Old English has been three-fold.
  1. First, I perform a simple look-up using four dictionaries (Bosworth-Toller, Clark-Hall, Bright’s glossary, a glossary of OE poetry from Texas), then save the results.
  2. Second and independently, I run the search-term through an inflectional search, returning the most likely part of speech based on suffixes and prefixes, then generate a weight based on whether or not that POS is closed-class or open-class. Those results are also saved.
  3. Third and finally, I check the search-term against a list of lemma that I compiled by combining Dictionary of Old English lemma and Bosworth-Toller lemma. If the lemmata is not found in the list of lemma, then I send it to an inflectional search, take the returned inflectional category and generate all possible forms, then search the list for one of those forms; if the form matches an existing lemma, then I break; if not, I do it again by taking the next most likely part of speech. Those results are also saved.

After these three steps run independently on the target sentence, I compare all three sets of saved results and weigh them accordingly. No possibilities are omitted until syntactic information can be adduced.

(Although I haven’t written it up, a search of the Helsinki Corpus might be useful as a fourth approach: if the term is parsed in the YCOE, that information could add to the weight of likelihood.)

Taking three approaches and comparing three sets of results is about 85% accurate.

Syntax

In order to improve the weights on the guesses, I’m writing a python class to guess syntactic patterns. I would like the class to examine the words without any information on their inflections or roots. The percentages here are not very good, but if you accumulate guesses, then accuracy improves (solving for theta). So, I look at
  1. the position of the term in a sentence. The percentages here are barely useful. If the term is in the first half of a prose sentence, then it is more likely than not (51%) to be something other than a verb or adverb. If the term is in the second half of the sentence, then it is more likely than not to be a verb or adverb. These percentages are discovered by parsing all sentences in the Corpus except those that derive from OE glosses on Latin—where underlying Latin word-order corrupts the data.
  2. its relative position with respect to closed-class words. These percentages are a little more useful.  For example, if the term follows a determiner, then it is more likely to be a noun or adjective than to be a verb.
  3. whether or not it is in a prepositional phrase and if so where. The word immediately following the preposition is likely either a noun or an adjective.
  4. whether or not it alliterates with other words in the sentence (OE alliteration tends to prioritize nouns, adjectives, and verbs).

The point of this python class is to come to a judgment about the part of speech of a term without looking it up in a wordlist. So far, a class meant to identify prepositional phrases works fairly well—I still need to deal with compound objects.

Screen shot of tagger with Prepositional Phrases

 

You’ll notice in the screenshot above that the tagger returns prepositional phrases. If you know python, you can see that highly likely tags are returned as strings and that less likely tags are returned in a list. This distinction in data types allows me to anticipate the syntax parser with a type() query. If type() == list, then ignore. You’ll notice that it has mischaracterized the last word, gereste, as a noun or adjective. It is a verb.

Next ?

The last step is to merge the two sets of weights together and select the most likely part of speech for a word. Since the result is so data-rich, it allows a user to search for syntactic patterns as well as for words, bigrams, trigrams, formulae, and so forth.

So, a user could search for all adjectives that describe cyning ‘king’ or cwen ‘queen’. Or find all adjectives that describe both. Or all verbs used of queens. Or how may prepositional phrases mention clouds.

[29 March 2023] The beta-parser is available on github as “oenouns”: https://github.com/sharris-umass

Bigrams

9 March 2019. Puzzling out a word jumble, I’m writing a python script to search a grid for words. Step one is to compile a list of legal bigrams in English. Bigrams are two letters that go side-by-side. So the letter <Q> in English has a limited list of bigrams. We see <QU> as in quit, <QA> as in Qatar (and a few others if you allow very rare words).

I found a huge list online of English words compiled from web pages. 2.5 megs of text file! Here is the resulting python dict of bigrams:

{'A':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
'B':['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'G', 'P', 'Z', 'Q'],
'C':['A', 'I', 'K', 'T', 'U', 'E', 'O', 'Y', 'H', 'C', 'L', 'M', 'N', 'Q', 'R', 'S', 'D', 'B', 'W', 'Z', 'G', 'P', 'F'],
'D':['V', 'W', 'E', 'I', 'O', 'L', 'N', 'A', 'U', 'G', 'Y', 'R', 'P', 'C', 'D', 'F', 'H', 'J', 'M', 'S', 'T', 'Z', 'B', 'K', 'Q'],
'E':['H', 'R', 'D', 'N', 'E', 'S', 'M', 'Y', 'V', 'L', 'A', 'C', 'I', 'P', 'T', 'K', 'Z', 'U', 'G', 'W', 'B', 'F', 'O', 'X', 'Q', 'J'],
'F':['F', 'T', 'A', 'U', 'O', 'E', 'I', 'Y', 'L', 'G', 'R', 'S', 'W', 'Z', 'N', 'V', 'H', 'B', 'K', 'D', 'M', 'J', 'P', 'C'],
'G':['I', 'E', 'H', 'L', 'N', 'A', 'Y', 'O', 'R', 'M', 'U', 'S', 'D', 'G', 'K', 'P', 'B', 'W', 'T', 'F', 'C', 'V', 'J', 'Z'],
'H':['R', 'E', 'L', 'M', 'I', 'Y', 'O', 'U', 'A', 'T', 'N', 'S', 'W', 'B', 'P', 'Z', 'G', 'C', 'F', 'D', 'H', 'J', 'K', 'V', 'Q'],
'I':['C', 'T', 'N', 'S', 'O', 'E', 'A', 'Z', 'R', 'L', 'D', 'U', 'P', 'G', 'B', 'V', 'F', 'M', 'I', 'X', 'K', 'Y', 'W', 'H', 'Q', 'J'],
'J':['E', 'O', 'U', 'A', 'I', 'H', 'J', 'R', 'Y', 'P', 'D', 'M', 'W', 'L', 'T', 'N', 'B', 'K'],
'K':['A', 'H', 'E', 'I', 'Z', 'M', 'N', 'B', 'S', 'L', 'O', 'C', 'K', 'P', 'R', 'T', 'U', 'W', 'Y', 'D', 'F', 'G', 'J', 'V'],
'L':['F', 'L', 'U', 'I', 'O', 'E', 'Y', 'A', 'M', 'T', 'S', 'N', 'V', 'C', 'D', 'B', 'G', 'H', 'P', 'R', 'K', 'W', 'J', 'Q', 'Z', 'X'],
'M':['A', 'P', 'E', 'B', 'I', 'O', 'H', 'U', 'Y', 'M', 'S', 'T', 'F', 'L', 'W', 'N', 'R', 'C', 'G', 'V', 'K', 'D', 'J', 'Z', 'Q'],
'N':['I', 'A', 'C', 'E', 'D', 'T', 'U', 'O', 'S', 'R', 'G', 'Y', 'M', 'N', 'Z', 'L', 'P', 'K', 'F', 'H', 'Q', 'B', 'J', 'V', 'X', 'W', '-'],
'O':['L', 'N', 'R', 'S', 'I', 'M', 'T', 'U', 'G', 'O', 'W', 'A', 'B', 'D', 'H', 'V', 'X', 'C', 'K', 'Z', 'P', 'Y', 'E', 'F', 'Q', 'J'],
'P':['E', 'T', 'O', 'Y', 'I', 'H', 'S', 'R', 'A', 'N', 'U', 'L', 'P', 'M', 'J', 'B', 'D', 'F', 'W', 'K', 'C', 'G', 'V', 'Q'],
'Q':['U', 'I', 'A', 'R', 'E', 'O', 'Q'],
'R':['D', 'O', 'U', 'E', 'A', 'I', 'T', 'Y', 'R', 'S', 'V', 'M', 'B', 'P', 'G', 'N', 'H', 'L', 'F', 'W', 'C', 'K', 'J', 'Q', 'X', 'Z'],
'S':['C', 'T', 'A', 'E', 'S', 'I', 'G', 'H', 'K', 'O', 'M', 'U', 'F', 'Q', 'V', 'Y', 'P', 'L', 'N', 'B', 'W', 'D', 'R', 'J', 'Z'],
'T':['E', 'I', 'O', 'H', 'A', 'T', 'U', 'C', 'N', 'S', 'R', 'M', 'L', 'Y', 'B', 'P', 'F', 'W', 'K', 'Z', 'D', 'G', 'J', 'V', 'Q', 'X'],
'U':['A', 'S', 'L', 'R', 'C', 'M', 'N', 'D', 'T', 'E', 'V', 'P', 'Z', 'B', 'I', 'O', 'X', 'G', 'K', 'F', 'Y', 'W', 'J', 'H', 'Q', 'U'],
'V':['A', 'E', 'I', 'O', 'U', 'Y', 'S', 'R', 'C', 'L', 'V', 'N', 'Z', 'D', 'K', 'G'],
'W':['O', 'H', 'A', 'E', 'I', 'L', 'N', 'S', 'T', 'R', 'M', 'U', 'Y', 'B', 'P', 'W', 'D', 'F', 'K', 'C', 'G', 'Z', 'Q', 'V', 'J'],
'X':['I', 'A', 'Y', 'T', 'E', 'O', 'U', 'M', 'P', 'C', 'B', 'F', 'H', 'L', 'S', 'W', 'R', 'D', 'K', 'N', 'G', 'Q', 'Z', 'V'],
'Y':['S', 'M', 'A', 'R', 'C', 'P', 'G', 'I', 'L', 'N', 'D', 'T', 'X', 'O', 'E', 'Z', 'U', 'F', 'W', 'H', 'B', 'Y', 'K', 'V', 'J', 'Q'],
'Z':['E', 'A', 'U', 'Z', 'I', 'O', 'L', 'G', 'Y', 'R', 'H', 'T', 'N', 'B', 'D', 'P', 'K', 'C', 'M', 'V', 'S', 'F', 'W']
}

And here is the code to get the bigrams (my file of words is called web2.txt, and each word is on a separate line). In order to limit the bigrams to a list of unique letters, I use set().

import os

path = os.getcwd()
path += '/web2.txt'

bigrams = {'A':[], 'B':[], 'C':[], 'D':[], 'E':[], 'F':[], 'G':[], 'H':[], 'I':[], 'J':[],
           'K':[], 'L':[], 'M':[], 'N':[], 'O':[], 'P':[], 'Q':[], 'R':[], 'S':[], 'T':[],
           'U':[], 'V':[], 'W':[], 'X':[], 'Y':[], 'Z':[]}

with open(path, 'r') as allwords:
    words = allwords.read().split('\n')
    allwords.close()

for letter in bigrams.keys():
    letter = letter.upper()

    for word in words:
        word = word.upper()
        if letter in word:
            if word.index(letter) < len(word):
                try:
                    nextletter = word[word.index(letter) + 1]
                    if nextletter not in set(bigrams[letter]):
                        bigrams[letter].append(nextletter)
                except IndexError:
                    continue

    print('\'{0}\':{1}, '.format(letter, bigrams[letter]))

Bigrams

March 6. An interim step in making a semantic map of Old English is producing bigrams. Bigrams are pairs of words. In order to build a social network of words, you need to know which words connect to one another. For example, in Beowulf, the word wolcnum ‘clouds’ almost always sits next to under ‘under’.

By comparison, the epic poem Judith has no clouds in it. And the homilist Ælfric never uses the phrase under wolcnum.

Here is a screen shot of words that follow ic ‘I’ in the poem Beowulf. So, the first is “ic nah.”

You can see that there are 181 instances of ic, although only 80 are unique. In other words, some bigrams are repeated. The second word of the bigram is printed again in red, and passed to a part-of-speech tagger. The blue text is the tagger’s best guess, and it also returns the part-of-speech most cited by dictionaries. As I plan to discuss in an article, ic is very rarely followed by a verb.

We can discover a great deal about poetic style by looking very closely at the grammar of Old English poetry. The grammar is the unfolding in time of images and ideas and asides and so forth. Grammar describes how the words affect you in order as you read.

About three-quarters there

Screen shot 12/2/2018.You’re only as good as your data

That is the lesson here. Single brackets [x] indicate an entry in Ondrej Tichy‘s Bosworth-Toller, which I edited into a json file. Double brackets [[x]] indicate an entry in the raw data of Ondrej’s BT, if the word wasn’t found in the json file. Empty brackets indicate no returned value. A word like mæg can mean ‘may’ (V) or ‘kin’ (N). The word didn’t make the structured data, and the raw data mischaracterized it in its verbal form, so the parser didn’t pick up the verb.

Rather than spend days improving the data from Bosworth-Toller, or overwhelm the servers in Prague with BeautifulSoup requests, I’m going to scrape word lists from Old English sites, and OCR some glossaries from freely-available books. If I can compile 10 or 20 word lists and zip them to grammatical information, I can get a percentage of likelihood for any given word. Second, I can use the York-Helsinki Parsed Corpus of Aelfric’s prose through CLTK. It won’t catch all of the words, but might be a help.

I’ve written a simple script to inflect any noun or adjective and to conjugate any verb. I can work it backwards to find the root form of a word, then send that to BT.

Final step is to run the words and forms through a syntactic parser. If it sees ne, which carries a weight of 5, then it increases the likelihood that the next word is a verb, since negative particles almost always sit next to verbs in OE. (One can check that with a bigram search.) Similar proximity searches to prepositions, pronouns, and so forth help to assess weights (probabilities).

Once this next layer is completed, and the weights adjusted, I will have a decent control to check the more experimental parser.

Poetic Words

sort | uniq

Has anyone has done this since Angus Cameron suggested it in 1973? I separated the Corpus of Old English into genres and sub-genres. It enabled me to find words unique to poetry. The poetic texts are largely from the ASPR, but include Chronicle poems, the Meters of Boethius, and others.

First, I sorted the words into alphabetical order and removed duplicates. Second, I did the same for all prose texts. I also removed all foreign words from the prose texts—those are words that the Dictionary of Old English designated as foreign by placing them within <foreign> tags. Third, I compared prose words with poetic words. The resulting list is a set of all words used only in the poetic texts. Here is the file (right-click to download): PoeticWords

The next step is to classify each word by word class. That will allow me to differentiate verbal phrases from noun phrases in the poetry. Once noun phrases are isolated, I can begin to build a semantic map of poetic discourse in Old English. Afterwards, I’ll add verb phrases. So we’ll be able to know how OE poets described queens (adjectives) and what sort of acts queens performed (verbs), and compare that to descriptions of kings and the acts they performed. We can then further differentiate dryhten from cyning, and cwen from ides. But there’s a big caveat.

Because Old English poets wrote alliterative verse, adjectives and verbs may have been chosen simply on account of their initial sound. So, cwen may have attracted /k/-initial words. That is why it is essential to also build a map in prose of cwen. Since the formal structure of prose was not governed by alliteration (with the possible exception of Ælfric), the map in prose and the map in poetry of any given noun might well be distinct.