Old English word shape

Posted on May. 3.2019 by Stephen Harris

Crosswords are intriguing programming problems. How do you generate a New-York-Times-style crossword puzzle from a list of words? During an attempt (in Old English of course), I noticed some very interesting features of Old English words.

Consider the upper-left section of the crossword puzzle. An easy solution to generating words is to map the phonological shape of words before searching for instances of that shape in a list. So, an easy shape is consonants (C) and vowels (V) alternating. If 1-across is C-V-C-V, then the next word down, 13-across, is V-C-V-C. Then, you search the list of OE 4-letter words for that shape.

Say 1-across is C-V-C-V, bana ‘murderer’. 13-across might be V-C-V-C, eþel ‘homeland’. That sets up 1-down to start with b–e– and 2-down to start with æ–þ-. You’d think there would be plenty of words to fit that scheme.

But there are not! After extracting all words from the poetic corpus of Old English, I divided them into 3-, 4-, 5-, 6-, and 7-letter word lists. There are 133 tokens (words rather than lexemes) with the shape V-C-V-C:

abæd, abal, aban, aber, abit, acas, acol, acul, acyr, adam, adan, aðas, ades, adon, aðum, æcer, æðel, æfen, æfyn, ægen, æled, æleð, ænig, ænyg, æren, æres, æror, ærur, ætan, æten, ætes, æton, ætyw, æxum, afyr, agæf, agan, agar, agef, agen, agif, agof, agol, agon, agun, ahef, ahof, ahon, alæd, alæg, alæt, ales, alyf, alys, amæt, amen, amet, amor, anes, anum, arað, aras, ares, arim, aris, arod, arum, atol, aweg, awer, awoc, awox, axan, aþas, ecan, eces, ecum, eðan, eðel, edom, efen, enoc, enos, eror, etan, eteð, evan, eþel, ican, iceð, idel, ides, iren, isac, isen, isig, oðer, ofæt, ofen, ofer, ofet, ofir, ofor, onet, open, oreb, oroð, oruð, ower, oxan, oþer, ufan, ufon, ufor, upon, ures, urum, user, usic, utan, uten, uton, ycað, ycan, yced, yðum, yfel, ytum, ywan, ywaþ, ywed, yweð, yþum

And words with the shape C-V-C-V number 484:

bacu, baða, bæce, bæle, bære, bana, bare, baru, baþu, bega, bena, bene, bera, bere, bete, beþe, bide, bite, boca, boda, body, boga, bona, bote, bure, buta, bute, butu, byge, byme, byre, cafe, care, cele, cene, cepa, ciða, cile, come, cuðe, cuma, cume, cuþe, cyle, cyme, cymu, cyre, cyþe, dæda, dæde, dæge, dæle, dæne, ðæra, ðære, daga, dalu, ðane, ðara, dare, dege, dema, demæ, deme, dena, dene, ðere, ðine, dole, doma, dome, ðone, duna, dune, dura, dure, duru, dyde, ðyle, dyne, dyre, faca, fæce, fæge, fæla, fæle, fære, fana, fane, fara, fare, feða, feðe, fela, fele, fere, feþa, feþe, fife, fira, fire, five, fore, fota, fote, fula, fule, fuse, fyra, fyre, gara, gare, gatu, gedo, gena, geno, gere, geta, gife, gifu, gina, goda, gode, godu, gota, guðe, guma, gume, gute, guþe, gyfe, gyme, gyta, gyte, hada, hade, hæle, hælo, hælu, hæse, hæto, hafa, hafo, hafu, hale, hali, hama, hame, hara, hare, hata, hate, hefe, hege, hele, helo, here, hete, hewe, hige, hina, hine, hira, hire, hiwa, hiwe, hofe, hofu, hole, hopa, hope, horu, huðe, huga, huna, huru, husa, huse, huþa, huþe, hyde, hyðe, hyge, hyne, hyra, hyre, hyse, lace, laða, lade, laðe, læce, læde, læla, læne, lænu, lære, læte, lafe, lage, lago, lagu, lama, lame, lara, lare, lata, late, latu, laþe, lefe, lega, lege, lene, lete, lica, lice, lida, liða, lide, liðe, life, lige, lima, lime, liþe, liþu, locu, lofe, lufa, lufæ, lufe, lufu, lyfe, lyge, lyre, mæca, mæða, mæga, mæge, mæla, mæle, mæne, mæra, mære, mæro, mæru, mæte, maga, mage, mago, magu, mana, mane, mara, mare, meca, mece, meda, mede, meðe, medo, medu, melo, mere, mete, meþe, mide, mila, mine, moda, mode, modi, mona, more, mose, mote, muðe, muþa, muþe, myne, naca, næle, næni, nære, næte, nama, name, nane, neda, nede, nefa, nele, niða, niðe, nime, nine, niwe, niþa, niþe, noma, nose, noþe, nyde, nyle, race, racu, rade, raðe, ræda, ræde, raþe, rece, reða, reðe, rene, reþe, rica, rice, ricu, ride, rime, ripa, ripe, rode, rofe, rome, rope, rowe, rume, runa, rune, ryha, ryne, sace, sacu, sade, sæce, sæda, sæde, sæge, sæla, sæle, sæne, saga, sale, salo, salu, same, sara, saræ, sare, sari, sece, seðe, sefa, sege, sele, seme, sene, sete, sida, siða, side, siðe, sido, sige, sile, sina, sine, site, siþa, siþe, soða, soðe, some, sona, sone, soþa, soþe, sume, suna, suno, sunu, syle, syne, synu, sype, syre, syþe, tæle, tæso, tala, tale, tame, tane, tela, tene, tida, tiða, tide, tiðe, tila, tile, tima, tire, toða, tome, toþe, tuge, tyne, waca, wace, wada, wade, waðe, wado, wadu, waðu, wæda, wæde, wædo, wædu, wæge, wæle, wæra, wære, wæta, wage, wala, wale, walo, walu, wana, ware, waru, wega, wege, wela, wena, wene, wepe, wera, were, wese, wica, wida, wide, widu, wifa, wife, wiga, wige, wile, wina, wine, wire, wisa, wise, wita, wite, witu, woða, woma, wope, wora, woþa, wuda, wudu, wule, wuna, wyle, þæce, þæne, þæra, þære, þane, þara, þare, þine, þire, þone, þyle, þyre

Notice how many end with dative singular markers like –e. It suggests that we are looking at inflected forms of a C-V-C shape, one of the most common Proto-Indo-European root forms. Again, in the poetic corpus, there are 350 such tokens:

bad, bæc, bæd, bæð, bæg, bæl, bæm, bær, bam, ban, bat, bec, bed, beg, ben, bet, bid, bið, bil, bit, biþ, boc, boð, boh, bot, bur, byð, byþ, cam, can, cen, cer, ces, cið, col, com, con, cuð, cum, cuþ, cyð, cym, cyn, dæd, dæg, dæl, ðæm, ðær, ðæs, ðæt, ðah, ðam, ðan, ðar, ðas, day, ðec, deð, ðeð, ðeh, dem, ðem, ðer, ðes, ðet, deþ, dim, ðin, ðis, doð, dol, dom, don, ðon, dor, doþ, dun, ðus, dyn, ðyn, ðys, fæc, fær, fæt, fag, fah, fam, fan, far, fed, fel, fen, fet, fex, fif, fin, foh, fon, for, fot, ful, fus, fyl, fyr, fys, gad, gað, gæd, gæð, gæþ, gal, gan, gar, ged, gem, gen, get, gid, gif, gim, gin, git, god, guð, gyd, gyf, gyt, had, hæl, hær, hæs, hal, ham, har, hat, heg, heh, hel, her, het, hig, him, his, hit, hiw, hof, hoh, hol, hun, hus, hyd, hyð, hys, hyt, lac, lad, lað, læd, læf, læg, læn, lær, læs, læt, laf, lah, lar, laþ, lef, leg, len, let, lic, lid, lið, lif, lig, lim, lit, liþ, loc, lof, log, lot, lyt, mað, mæg, mæl, mæn, mæt, mæw, man, mec, men, mid, mið, min, mit, mod, mon, mor, mos, mot, muð, muþ, næs, nah, nam, nan, nap, nas, nat, neb, ned, neh, nes, nið, nim, nis, niþ, nom, non, num, nyd, nys, nyt, pyt, rad, ræd, ran, rec, ren, rex, rib, rim, rod, rof, rot, rum, run, ryn, sæd, sæl, sæm, sæp, sæs, sæt, sag, sah, sal, sar, sec, sel, sem, sib, sic, sid, sið, sin, sit, siþ, soð, sol, soþ, suð, sum, syb, syn, syx, syþ, tan, teð, teþ, tid, til, tin, tir, tor, tun, tyd, tyn, tyr, wac, wæf, wæg, wæl, wæn, wær, wæs, wæt, wag, wah, wan, was, wat, web, wed, weg, wel, wen, wep, wer, wes, wet, wic, wid, wið, wif, wig, win, wir, wis, wit, wiþ, woc, wod, woð, woh, wol, wom, won, wop, wyð, wyl, wyn, wyt, zeb, þæh, þæm, þær, þæs, þæt, þam, þan, þar, þas, þat, þec, þeh, þem, þer, þes, þet, þin, þis, þon, þus, þyð, þyn, þys

But the alternate, V-C-V has far fewer instances. I count 51:

ace, ada, aða, ade, aðe, ado, æce, æna, æne, æni, æra, æse, æte, æwæ, aga, age, ana, ane, ara, are, awa, awo, eca, ece, eci, eðe, ege, ele, ely, esa, ete, eþe, iða, ige, ipe, oga, ore, oxa, uðe, una, ura, ure, uta, ute, utu, uþe, yða, yðe, yne, yþa, yþe

269 of the CVC forms overlap with the CVCV forms, suggesting that

CVCV forms that overlap with CVC forms are inflections of the root, or
they are coincidentally similar and represent two different lexemes

As I continue to refine my OE Parser, I wonder whether employing PIE root forms might be useful in identifying lexemes. Certainly, when I turn to programming James E. Cathey’s tremendous diachronic phonology of Germanic languages, root form/shape will play an essential role. One of the methods I wrote in python checks for root form/shape, and I hoped to use it to identify spelling variants—allowing variation only in root vowels of a form. So: C-V(1)-C, C-V(2)-C, … C-V(n)-C.

Back to the crossword!

Bigrams

Posted on March. 9.2019 by Stephen Harris

9 March 2019. Puzzling out a word jumble, I’m writing a python script to search a grid for words. Step one is to compile a list of legal bigrams in English. Bigrams are two letters that go side-by-side. So the letter <Q> in English has a limited list of bigrams. We see <QU> as in quit, <QA> as in Qatar (and a few others if you allow very rare words).

I found a huge list online of English words compiled from web pages. 2.5 megs of text file! Here is the resulting python dict of bigrams:

{'A':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
'B':['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'G', 'P', 'Z', 'Q'],
'C':['A', 'I', 'K', 'T', 'U', 'E', 'O', 'Y', 'H', 'C', 'L', 'M', 'N', 'Q', 'R', 'S', 'D', 'B', 'W', 'Z', 'G', 'P', 'F'],
'D':['V', 'W', 'E', 'I', 'O', 'L', 'N', 'A', 'U', 'G', 'Y', 'R', 'P', 'C', 'D', 'F', 'H', 'J', 'M', 'S', 'T', 'Z', 'B', 'K', 'Q'],
'E':['H', 'R', 'D', 'N', 'E', 'S', 'M', 'Y', 'V', 'L', 'A', 'C', 'I', 'P', 'T', 'K', 'Z', 'U', 'G', 'W', 'B', 'F', 'O', 'X', 'Q', 'J'],
'F':['F', 'T', 'A', 'U', 'O', 'E', 'I', 'Y', 'L', 'G', 'R', 'S', 'W', 'Z', 'N', 'V', 'H', 'B', 'K', 'D', 'M', 'J', 'P', 'C'],
'G':['I', 'E', 'H', 'L', 'N', 'A', 'Y', 'O', 'R', 'M', 'U', 'S', 'D', 'G', 'K', 'P', 'B', 'W', 'T', 'F', 'C', 'V', 'J', 'Z'],
'H':['R', 'E', 'L', 'M', 'I', 'Y', 'O', 'U', 'A', 'T', 'N', 'S', 'W', 'B', 'P', 'Z', 'G', 'C', 'F', 'D', 'H', 'J', 'K', 'V', 'Q'],
'I':['C', 'T', 'N', 'S', 'O', 'E', 'A', 'Z', 'R', 'L', 'D', 'U', 'P', 'G', 'B', 'V', 'F', 'M', 'I', 'X', 'K', 'Y', 'W', 'H', 'Q', 'J'],
'J':['E', 'O', 'U', 'A', 'I', 'H', 'J', 'R', 'Y', 'P', 'D', 'M', 'W', 'L', 'T', 'N', 'B', 'K'],
'K':['A', 'H', 'E', 'I', 'Z', 'M', 'N', 'B', 'S', 'L', 'O', 'C', 'K', 'P', 'R', 'T', 'U', 'W', 'Y', 'D', 'F', 'G', 'J', 'V'],
'L':['F', 'L', 'U', 'I', 'O', 'E', 'Y', 'A', 'M', 'T', 'S', 'N', 'V', 'C', 'D', 'B', 'G', 'H', 'P', 'R', 'K', 'W', 'J', 'Q', 'Z', 'X'],
'M':['A', 'P', 'E', 'B', 'I', 'O', 'H', 'U', 'Y', 'M', 'S', 'T', 'F', 'L', 'W', 'N', 'R', 'C', 'G', 'V', 'K', 'D', 'J', 'Z', 'Q'],
'N':['I', 'A', 'C', 'E', 'D', 'T', 'U', 'O', 'S', 'R', 'G', 'Y', 'M', 'N', 'Z', 'L', 'P', 'K', 'F', 'H', 'Q', 'B', 'J', 'V', 'X', 'W', '-'],
'O':['L', 'N', 'R', 'S', 'I', 'M', 'T', 'U', 'G', 'O', 'W', 'A', 'B', 'D', 'H', 'V', 'X', 'C', 'K', 'Z', 'P', 'Y', 'E', 'F', 'Q', 'J'],
'P':['E', 'T', 'O', 'Y', 'I', 'H', 'S', 'R', 'A', 'N', 'U', 'L', 'P', 'M', 'J', 'B', 'D', 'F', 'W', 'K', 'C', 'G', 'V', 'Q'],
'Q':['U', 'I', 'A', 'R', 'E', 'O', 'Q'],
'R':['D', 'O', 'U', 'E', 'A', 'I', 'T', 'Y', 'R', 'S', 'V', 'M', 'B', 'P', 'G', 'N', 'H', 'L', 'F', 'W', 'C', 'K', 'J', 'Q', 'X', 'Z'],
'S':['C', 'T', 'A', 'E', 'S', 'I', 'G', 'H', 'K', 'O', 'M', 'U', 'F', 'Q', 'V', 'Y', 'P', 'L', 'N', 'B', 'W', 'D', 'R', 'J', 'Z'],
'T':['E', 'I', 'O', 'H', 'A', 'T', 'U', 'C', 'N', 'S', 'R', 'M', 'L', 'Y', 'B', 'P', 'F', 'W', 'K', 'Z', 'D', 'G', 'J', 'V', 'Q', 'X'],
'U':['A', 'S', 'L', 'R', 'C', 'M', 'N', 'D', 'T', 'E', 'V', 'P', 'Z', 'B', 'I', 'O', 'X', 'G', 'K', 'F', 'Y', 'W', 'J', 'H', 'Q', 'U'],
'V':['A', 'E', 'I', 'O', 'U', 'Y', 'S', 'R', 'C', 'L', 'V', 'N', 'Z', 'D', 'K', 'G'],
'W':['O', 'H', 'A', 'E', 'I', 'L', 'N', 'S', 'T', 'R', 'M', 'U', 'Y', 'B', 'P', 'W', 'D', 'F', 'K', 'C', 'G', 'Z', 'Q', 'V', 'J'],
'X':['I', 'A', 'Y', 'T', 'E', 'O', 'U', 'M', 'P', 'C', 'B', 'F', 'H', 'L', 'S', 'W', 'R', 'D', 'K', 'N', 'G', 'Q', 'Z', 'V'],
'Y':['S', 'M', 'A', 'R', 'C', 'P', 'G', 'I', 'L', 'N', 'D', 'T', 'X', 'O', 'E', 'Z', 'U', 'F', 'W', 'H', 'B', 'Y', 'K', 'V', 'J', 'Q'],
'Z':['E', 'A', 'U', 'Z', 'I', 'O', 'L', 'G', 'Y', 'R', 'H', 'T', 'N', 'B', 'D', 'P', 'K', 'C', 'M', 'V', 'S', 'F', 'W']
}

And here is the code to get the bigrams (my file of words is called web2.txt, and each word is on a separate line). In order to limit the bigrams to a list of unique letters, I use set().

import os

path = os.getcwd()
path += '/web2.txt'

bigrams = {'A':[], 'B':[], 'C':[], 'D':[], 'E':[], 'F':[], 'G':[], 'H':[], 'I':[], 'J':[],
           'K':[], 'L':[], 'M':[], 'N':[], 'O':[], 'P':[], 'Q':[], 'R':[], 'S':[], 'T':[],
           'U':[], 'V':[], 'W':[], 'X':[], 'Y':[], 'Z':[]}

with open(path, 'r') as allwords:
    words = allwords.read().split('\n')
    allwords.close()

for letter in bigrams.keys():
    letter = letter.upper()

    for word in words:
        word = word.upper()
        if letter in word:
            if word.index(letter) < len(word):
                try:
                    nextletter = word[word.index(letter) + 1]
                    if nextletter not in set(bigrams[letter]):
                        bigrams[letter].append(nextletter)
                except IndexError:
                    continue

    print('\'{0}\':{1}, '.format(letter, bigrams[letter]))

Germanization of English

Posted on December. 21.2012 by Stephen Harris

It seems that one of the background processes in American English is an increase in adnominal adjectives. You don’t make a choice about a college, but a college choice. The prepositional phrase is turned into a pre-position adjective, turning a noun into an adjective, rendering a compound noun worthy of German. I’ve noticed hundreds. Academics don’t have meetings of the faculty, they have faculty meetings; they no longer discuss the curriculum, they have curriculum discussions. In a recent memo, someone was described as a community heritage preserver, which is an astounding way of saying that she preserves the heritage of her community.

So what? Well, a tea cup is a thing to drink from, and a cup of tea is an amount of tea–I’m making a cup of tea is not the same as I’m making a tea cup. Likewise, a meeting of the faculty is a meeting, plain and simple, comprised of faculty. A faculty meeting is a kind of meeting. The first form describes a genus (meeting) populated by a species (faculty); the second introduces a new genus. English speakers do not have kinds of meetings (people meetings, carnival meetings, everyone-in-red meetings, cornhusker-fan meetings, and so forth). English speakers have meetings, plain and simple. Although English speakers can figure out what is meant by both forms, the compounded form adds unnecessary and sometimes misleading ambiguity.

Update [7/19/2015]: The metathesis of adnomial genitives (e.g., leaf of laurel) into denomial adjectives (laurel leaves, Lat. folia laurea) appears to have been popular during the late Republic and early Empire in Rome. Cicero shied from the pattern, preferring the genitives.

So, let’s say that someone studies metaphors of feces in the works of Chaucer, as in “fecopoetics“–read up on it for $105.00. Such a study would actually be a study of images of feces. (Chaucer left us no actual feces.) The order of nouns runs general-and-inclusive to specific-and-exclusive: a study, which is of images, and more specifically of images of feces. The last two nouns and their accompanying words get compounded to feces images, or rather, fecal images, which wrongly introduces a new genus of images. English speakers do not have categories of images (images of birds, images of Joe, images of animals recently returned from drinking, and so forth). English speakers have images, plain and simple. The compounding form fecal is then transferred backwards and upwards to make a new genus of study: fecal study. Or, more likely, pluralized to fecal studies.

Thus arises the proliferation of areas of study. We have fecal studies, gender studies, postcolonial studies, Marxist studies, and so forth. The implication is that each of these compounds represents a new field, which is incorrect. All belong to the same academic field: the study of culture, called “cultural studies.” And all employ the same general methods of the field of cultural studies. What differs is the object of study. One decent and clever fellow, Asa Mittman, wants to start the discipline of monster studies since he studies images of monsters. I honestly don’t see a university opening up a hiring line in monster studies. But I do see universities regularly hiring in cultural studies. So why hide the obvious strength of a mutually-supportive, communal enterprise in divisive compounds—or is that “compounds of divisiveness”?

UPDATE (1/29/2019): A terrific example just arrived by email, a description of a presentation. It reads:

Most methods for relation extraction from text rely on pre-trained entity resolution models in order to find the entities mentioned in text. Text-enhanced knowledge graph (KG) completion methods also rely on such an entity resolution model as a pre-processing step. We present a method to simultaneously learn entity resolution as well as relation extraction or KG completion without relying on a pre-trained entity resolution model or mention-level entity resolution data for training.

Wow! We can see a series of genitive + Noun-ion. So, to extract relations gets changed by turning the object (relations) into a plain adjective, and putting it before a noun ending in -ion (extraction). Similarly, a model that has been pre-trained to resolve entities becomes “pre-trained entity resolution model.” Try it at home. Consider a computer that has been programmed to calculate averages: would that be a programmed average-calculating computer. What a mess! But it’s the new norm. If you want to sound scientific, you take all of your descriptive or appositive phrases and cram them into adjectival positions. That would be positionally crammed adjectival phrase scientism.

launchd -harris

background processes

Category Archives: Language notes

Old English word shape

Bigrams

Germanization of English