About three-quarters there

Screen shot 12/2/2018.You’re only as good as your data

That is the lesson here. Single brackets [x] indicate an entry in Ondrej Tichy‘s Bosworth-Toller, which I edited into a json file. Double brackets [[x]] indicate an entry in the raw data of Ondrej’s BT, if the word wasn’t found in the json file. Empty brackets indicate no returned value. A word like mæg can mean ‘may’ (V) or ‘kin’ (N). The word didn’t make the structured data, and the raw data mischaracterized it in its verbal form, so the parser didn’t pick up the verb.

Rather than spend days improving the data from Bosworth-Toller, or overwhelm the servers in Prague with BeautifulSoup requests, I’m going to scrape word lists from Old English sites, and OCR some glossaries from freely-available books. If I can compile 10 or 20 word lists and zip them to grammatical information, I can get a percentage of likelihood for any given word. Second, I can use the York-Helsinki Parsed Corpus of Aelfric’s prose through CLTK. It won’t catch all of the words, but might be a help.

I’ve written a simple script to inflect any noun or adjective and to conjugate any verb. I can work it backwards to find the root form of a word, then send that to BT.

Final step is to run the words and forms through a syntactic parser. If it sees ne, which carries a weight of 5, then it increases the likelihood that the next word is a verb, since negative particles almost always sit next to verbs in OE. (One can check that with a bigram search.) Similar proximity searches to prepositions, pronouns, and so forth help to assess weights (probabilities).

Once this next layer is completed, and the weights adjusted, I will have a decent control to check the more experimental parser.

Fulbright Project

View from Dunton Tower at Carleton University looking north along the Rideau River towards the city of Ottawa.

I am very fortunate this year to have received a Fulbright award. The College of Humanities and Fine Arts at UMass made it possible for me to spend the academic year at Carleton University in Ottawa, Ontario, Canada. While here, I’m working on a natural-language parser of Old English, which I will use to create a semantic map of Old English nouns. In short, I want a computer to recognize an Old English noun and then find all words associated with it. Nouns are names for entities in the world. So a semantic map tells us something about how a language permits people to associate qualities with entities.

Following in the footsteps of Artificial Intelligence researchers like Waleed Ammar of the Paul Allen Institute, I will be using untagged corpora—that is, texts that no one has marked up for grammatical information. I would like to interfere with the data as little as possible.

What makes this project different from similar NLP projects is my aim. I want to produce a tool that can be used by literary critics. I am not interested in improving Siri or Alexa or a pop-up advertisement that wants to sell you shoes. Neither is my aim to propose hypotheses about natural languages, which is a general aim of linguistics-related NLPs. So, the object of my inquiry is artful writing, consciously patterned language.


The first stage is to write a standard NLP parser using tagged corpora. The standard parser will serve to check any results of the non-standard parser. Thanks to the generosity of Dr. Ondrej Tichý of Charles University in Prague, the standard parser is now equipped with a list of OE lexemes, parsed for form. A second control mechanism is the York-Helsinki Parsed Corpus of Old English, which is a tagged corpus of most of Aelfric’s Catholic sermons.


At the same time, I divided the OE corpus into genres. In poetic texts, dragons can breathe fire. But in non-fictional texts, dragons don’t exist. So a semantic field drawn around dragons will change depending on genre. I am subdividing the poetry according to codex, and then according to age (as far as is possible) to account for semantic shift. Those subdivisions will have to be revised, then abandoned as the AI engine gets running. (I’ll be using the python module fastai to implement an AI.)


Unicode. You’d think that searching a text string for another text string would be straightforward. But nothing is easy! A big part of preparing the Old English Corpus for manipulation is ensuring that the bits and bytes are in the right order. I had a great deal of difficulty opening the Bosworth-Toller structured data. It was in UTF-16 encoding, similar to the character encoding used on Windows machines. When I tried to open it via python, the interpreter threw an error. It turns out, Unicode is far more complex than one imagines. Although I can find ð and þ, for example, I cannot find them as the first characters of words after a newline (or even regex \b). Another hurdle.

Overcame it! The problem was in the data. For reasons unknown to me, Microsoft Windows encodes runic characters differently than expected. So the solution was to use a text editor (BB Edit), go into the data, and replace all original thorns with regular thorns. Same for eth, asc, and so forth. Weirdly, it didn’t look like I was doing anything: both thorns looked identical on the screen.


Screen shot of parser guts so far. Markup is data from Tichy’s Bosworth-TollerInflect receives Markup, then returns inflections based on the gender of strong nouns. Variants and special cases have not yet been included.

To finish STAGE ONE, I’ll now inflect every noun, pronoun, and adjective, then conjugate every verb as they come in (on the fly). Andrej Tichý at Charles University in Prague, who very generously sent me his structured data, took a slightly different approach: he generated all the permutations first and placed them into a list. Finally, as a sentence comes in, I’ll send off each word to the parser, receive its markup and inflections/conjugations, then search the markup for matches.

Square Roots

My daughter and I were recently playing with python’s square root function. She discovered that if you evaluate an even number of ones, the square root is half that number of three’s on both sides of the decimal. So √11 is approximately 3.3, and the √1111 is approximately 33.33, and so forth. We learned that this continues until there are eight three’s on either side of the decimal point, then they reduce in frequency.

The square root of an odd number of ones is also patterned. √1 is 1, √111 is 10.5, √11111 is 105.4, √1111111 is 1054.0, and so forth.

So we decided to write a python program to generate 20 instances. Here is the program:

#! /usr/bin/env python3
“””Determines the square roots of numbers comprised of ones like 11, 111, 1111, etc.”””
import math
bobby = 10
sue = 1
for x in range(1,20):
   answer = bobby + sue 
   sue = answer
   bobby = bobby*10
   print(x+1, ‘\tThe square root of ‘, answer, ‘ is ‘, math.sqrt(answer))

And here are the answers:

2 The square root of  11  is  3.3166247903554
3 The square root of  111  is  10.535653752852738
4 The square root of  1111  is  33.331666624997915
5 The square root of  11111  is  105.40872829135166
6 The square root of  111111  is  333.333166666625
7 The square root of  1111111  is  1054.0925006848308
8 The square root of  11111111  is  3333.3333166666666
9 The square root of  111111111  is  10540.925528624135
10 The square root of  1111111111  is  33333.33333166667
11 The square root of  11111111111  is  105409.25533841894
12 The square root of  111111111111  is  333333.33333316667
13 The square root of  1111111111111  is  1054092.553389407
14 The square root of  11111111111111  is  3333333.3333333167
15 The square root of  111111111111111  is  10540925.533894593
16 The square root of  1111111111111111  is  33333333.333333332
17 The square root of  11111111111111111  is  105409255.33894598
18 The square root of  111111111111111111  is  333333333.3333333
19 The square root of  1111111111111111111  is  1054092553.3894598
20 The square root of  11111111111111111111  is  3333333333.3333335

Although it looks like the sixes also multiply, they also reduce after reaching eight in a row. Check it out with python’s decimal package. from decimal import Decimal, then in the print statement, add Decimal(math.sqrt(answer)).

Free Will

Wednesday 6 December: a very exciting discussion about free will put on by the Erasmus Center. Sincere thanks to Jim Holden and to Erasmus for inviting me to respond to Peter Tse, author of The Neural Basis of Free Will (MIT Press, 2013).

My main point during the debate was that standards of proof and acceptable methods of testing are not yet available to neuro-scientists to establish a physiological basis of free will. Study of the neuron is the province of bio-chemistry, which has its own standards of proof and acceptable methods of testing. These standards have been developed over decades, through argument and counter-argument, and through experimentation. They are not optional—not if you seek accurate results. Freedom is a concept discussed for centuries by philosophers, theologians, political scientists, and historians. Each of those fields has its own standards of proof and acceptable methods of argumentation. Those standards are important to ensuring logical results. Will or volition is chiefly the province of psychology, with its own standards of proof and acceptable methods of testing. So bringing bio-chemical evidence to a philosophical debate about a psychological topic seems to me to be like, as Laurie Anderson said, trying to dance architecture.

A secondary point I made was that any logical investigation proceeds from the question that you set. So, setting the question correctly is essential. We would not have had a debate had Dr. Tse written a book entitled, The Neural Basis of Unconstrained Choice. The phrase “free will” connotes something in English that the phrases “unconstrained choice” or “unfettered desire” do not. So, I tried to show how desire is different from will in English, how French and Latin are different again, and how investigating free will in English entails different logical assumptions than investigating it in French or Latin. In English, will connotes desire, want, action. In French, arbitre connotes sight, judgment, observation. Different semantic fields with little overlap. Another example: the greatest virtue according to Christians is love. That’s English. In the Latin Bible, the word is caritas. You can also translate caritas as charity (faith, hope, and charity). You can give charity without being in love, such as for tax purposes. So which one is the virtue? Faith in the Latin Bible is fides, which can also be translated as loyalty. Which is it? There’s a big difference between obeying someone that you don’t believe in and believing someone whom you don’t obey. Same for freedom. The French prize Liberté, or liberty. Would Dr. Tse have found the same things if he looked for liberty of desire? I don’t think so.

I also made the case that, as Gertrude Stein said of Oakland, “there’s no there there.” “Free will” is a concept that English speakers use to talk about a whole host of connected ideas and psychological processes. Free will is not a thing. It doesn’t exist the way Plymouth Rock or the Boston Marathon exist. Where do you find free will? I say, in a dictionary.

The public discussion among the guests afterwards was terrific. No one in the room doubted that the brain is essential to thinking. But there seemed a general consensus that thought is not reducible to bio-chemistry. Some people made the point that our morality and personal values depend upon a non-reductive view, on a non-physicalist view, of will. Others said that there are psychological responses that we think are free, but are actually conditioned or instinctive. So we have to distinguish the choices that are free from those that are not. Others asked whether or not free will introduces randomness into science, and if so, to what degree. (I tend to think that decisions are not made randomly, but on the basis of stochastic algorithms that measure optimality by accounting for values, external conditions, imagined results, and so forth.) What was most apparent to me is that neuro-science is not going to trump dozens of disciplines, centuries of carefully thought-out positions, and carefully considered, methodical experimentation. It reaffirmed my faith in the multiplicity of a university, of a fundamental need for diversity of viewpoints, all speaking with each other, with each one grounded in a distinct intellectual tradition.

Bede and Verlaine

I just stumbled across Verlaine’s Bonheur, which he described as part of a Catholic triptych. He wrote it in the late 1880s and early 1890s, finishing the manuscript in January 1891. Poem 28 is in the form of an epanaleptic elegy, like Bede’s Hymn to Aethelthryth, modified into an ABABA rhyme. The first line matches the last. Here are the first two stanzas:

Les plus belles voix
De la Confrérie
Célèbrant le mois
Heureux de Marie.
O les douces voix!

Monsiuer le Curé
L’a dit à la Messe:
C’est le mois sacré.
Écoutons sans cesse
Monsieur le Curé.

I’m struck by how tenacious is the poetic form. Not only does the form allude to Latin elegies of the Church, but it also requires the poet to repeat a phrase in different contexts—which is a practice in prayer. Repetition in slightly different contexts allows the poet to restore a reader’s wandering imagination to a fixed narrative while serially enlarging the semantic force of the narrative. So we meet fairly cliché images: voices of monastery, a curate saying Mass. But slight variation and addition of phrases moves our imagination from topic to topic, connotation to connotation, and thereby gives fuller character to otherwise dry words. Verlaine focuses our attention first on superlative beauty, a characteristic we then apply to voices. We follow those voices to a confraternity. The verb célèbrant is in the plural, which places an image of celebration in our minds, but does not assign it to the singular confrérie. Instead, it is the beautiful voices that celebrate.

Next we greet the singular mois ‘month’, which until the /s/ at the end of the line leaves us in suspense, wondering if will end moine ‘monk’. The masculine mois is set in parallel with the previous line’s feminine Confrérie, indicating in part the capaciousness of the community, like a month that contains many days. A natural caesura at the end of the third line interrupts the semantic flow of the Noun + Adjective (mois heureux). This pause sets the initial words of line 3 and 4 in apposition: celebrate and joyful. And now, a bit of genius! Verlaine wraps a singular masculine phrase (le mois hereux) with two singular feminine nouns, confrérie and Marie. Here, one is reminded of Bede’s description of Caedmon’s monastery at Whitby where a community of men is ruled by a woman, Hild, and the Wisdom of God—in Latin, Wisdom is feminine.

After the celebratory kernel, we finish the wrapping with a transformed choir of voices. At first the most beautiful, now, transformed through their celebration of Mary, they are gentle and sweet. Similar transformation will take place with the mois ‘month’ which begins joyfully, then transforms in the second stanza to sacred, sacré.

This is a poem of nine stanzas. The number is not insignificant, as it is the number of months Christ was in the womb, the number of months it takes for the physical body to gain its spiritual soul, and perhaps the number of stanzas it takes for a reader suffused with the physical beauty of Verlaine’s verse to achieve a spiritual understanding.

Anglo-Saxon Lyre 4 (New Lyre)

The Serpentine Lyre.

The second lyre is named for the interlace pattern on its headstock. The last lyre had two dragon heads, so it’s the Dragon-head lyre. Simple. Number Two is made completely of mahogany. The Sutton Hoo lyre was oak and had no sound holes. So, it seems as if the original makers relied on the sonority of the box itself to carry the sound waves. That means paying close attention to grain direction and keeping the joints tight. I decided on a frame, much like a tortion box, but with the head stock exposed.

The frame is 3/4″ x 3/4″ mahogany. There are two main points of potential warp: torque on the cross-beam and tilt on the base. The headstock is so heavy and the string pins so thick that tilt is unlikely there. So, joinry type will be chosen to account for that.

Click on any picture to enlarge it.

The main crossbrace is joined with a through-tenon. If it is seated in the mortise well, then the brace won’t tilt (or roll) forward or backward, causing the lyre’s skin to buckle.

You can see that it’s pretty well seated. There’s a little daylight, but I filled that with epoxy. All the main structural joints are epoxied. The base of the lyre is going to anchor the force of the strings, like the anchorage of the Brooklyn Bridge. So the main counterforce, it seems, has to be against tilt (or roll). So I decided on a bridle joint. I also wanted a lot of exposed long grain for glue.

I kept the proportions of the joint to thirds, which left enough wood to sustain the pressure. Even with as hard a wood as mahogany, I wouldn’t make the frame any thinner. Next, I wanted to reinforce the base with a secondary brace that would also support the bridge. To counteract both tilt and torque, I decided on a half-lap dovetail to secure the brace to the sides, and mortise and tenon to secure the stiles.

Here you can see the seating for the half-lap dovetail:

It’s a lot of chisel work! But mahogany is gorgeous to work with. Here is the entire base architecture just before glue-up:

I used regular wood glue for the crossbrace assembly, since the joints are doing most of the work.

But just to be sure, I added dowels:

In the picture above, you can also see the mahogany skin. I skinned one side with 1/16″ mahogany veneer. It was too thin to support anything structurally, so I put nto a series of very thin supports. These were walnut. I wanted something with closer grain than mahogany, both to avoid splintering and to carry high-frequency sound more efficiently.

Here is the frame with one side of skin:

After all that work and all those beautiful joints, one is tempted to show it off–very Greene & Greene! But a clean aesthetic demands otherwise, so I hid the joinery. I don’t have pictures of carving the headstock, but you can see it in the following video. (I made the video so you can hear the sound.) The saddle is a strip of walnut that spans the mahogany skin and supports a mahogany bridge. The tailpiece is mahogany and clutches the body like a c-clamp. The strings are sent through holes and their ends are tied, like a classical guitar. The tuning pegs are zither pegs, tuned with a handle.

The tuning is CDGCDG. I came up with the rhythm and melody so that the music would match the syllabification of Caedmon’s Hymn, caesurae included. Here are the first four lines (click link to watch video):

SerpentineLyre Small

Old English Parser 6

Trinity College Library

Thanks to the generosity of the College of Humanties and Fine Arts at UMass, the parser is now working at a very basic level. Here it is: http://www.bede.net/misc/dublin/parse.html

Mark Faulkner of Trinity College Dublin aided by the generosity of the Irish Research Council brought me to Trinity in July to describe the parser. He hosted a terrific conference where a wide variety of scholars and scientists presented big-data projects. One conclusion from my perspective was that medieval texts (as well as all ancient languages) are as subject to the Law of Large Numbers as are network packets, social media posts, and traffic patterns. One surprise was how few samples were needed.

Specifically, texts (that is, generically similar sets of syntactically correct utterances) will approach audience expections as defined by genre, language, and period. These expectations are in turn characterized by sets of related vocabulary items. For example, a sermon employs certain words to indicate to an audience that it is indeed a sermon—or conversely, it is by virtue of certain vocabulary items that we classify a text as a sermon. (See Michael Drout, Lexomics.) Moreover, those vocabulary items tend ot come in serially-arranged chunks so that, as an extreme example, an adventure story may use words to describe setting and character before using words that describe dying.

Consequently, a parser that seeks to classify for genre or for aesthetic particulars considers not only vocabulary, but related vocabulary items, and the order in which those items appear over the span of the text. That parser will also consider phrases and phrase structure. Old English poetry, once recognized through its combination of rhetorical tropes (alliteration, hypotaxis, etc.), can be parsed metrically. More importantly, word combinations and clusters can be searched for and a stochastic algorithm applied in order to yield high-frequency clusters. That algorithm is the challenge. Although Google has developed similar algorithms to rank websites and to effect advertiser auctions, deciding on the likelihood of a grammatical claim is a different problem. My challenge over the next year or so is to break down the assumptions a fluent reader of Old English makes when reading a text. The biggest roadblock will be the desire to put in my own grammatical knowledge. A Natural Language Parser does not rely on a grammarian’s parse of a text. It relies instead on all texts ever written in that language.

The most exciting possibility to my mind is using glosses to tie OE literature into its Latin and Celtic analogues. A gloss is a single-word translation of one language into another. For example, French eau can be glossed by English water. Old English scribes glossed many Latin words. They wrote them in tiny letters above the Latin. Using these glosses, we could connect OE texts to Latin texts more closely and trace the migration and adaptation of images and collocations over time. Paired with information about the movement of manuscripts, we could map the dispersal of ideas, images, and metaphors over time and space.

Because some of these metaphors constitute a defining characteristics of a genre (such as lyric), we can watch the evolution of genre over time. And by examining the structure and constitution of these text in multiple languages, we could observe the interrelation and mutual influence of written culture.

The bottom line? In the next stage, I have to treat a text like an organism. Ask what other organisms are like it. Then try to dissect their DNA and determine which genetic markers came from where.

Perl Calendar > HTML

This perl script generates the HTML for a simple 12-month calendar. Run it in Terminal: it will ask which day of the week is the first of January. Then, copy-and-paste the results to your HTML code. Modify the script as you like.

#usr/bin/perl -w

# generate a calendar for HTML
# sharris@umass[dot]edu 2017

# ___________________________________________ declarations

my $start = 0;
my @monthNames = qw(January February March April May June July August September October November December );
my @month = qw(31 28 31 30 31 30 31 31 30 31 30 31);
my @dayNames = qw(Su M Tu W Th F Sa);

my $table = “<table width=’200′ border=’1′ cellpadding=’2′ cellspacing=’1′ bordercolor=’#000000′>”;
my $tr = “<tr align=’center’ valign=’middle’>”;

my $weekday = 0; # which day of the week is it?
my $date = 1; # what date is it?
my $m = 0; # total months for giant loop offset by 1
my $offset = 0; # offset counter
my $i = 7; # 7 days in a week

# ____________________________________________ setup
print “\nOn which day of the week is the first of January? (M T W H F S U)”;
$start = <STDIN>;

# now we set the week-counter to the day of the week minus 1 as offset

if ($start == “M”){$offset = 1;}
if ($start == “T”){$offset = 2;}
if ($start == “W”){$offset = 3;}
if ($start == “H”){$offset = 4;}
if ($start == “F”){$offset = 5;}
if ($start == “S”){$offset = 6;}
if ($start == “U”){$offset = 0;}

# ___________________________________________ BIG LOOP

foreach (@monthNames) {
print “<p><b>”.$_.”</b></p>\n”;
print $table;
print $tr;

foreach (@dayNames){
print “\n<td bgcolor=’#CCCCCC’><b>”.$_.”</b></td>”;
print “</tr>\n”;

$date = $month[$m] – ($month[$m] + $offset) + 1;

until ($date > $month[$m]){
for ($i = 7; $i > 0; $i–){
if ($i == 7){print “<tr align=’center’ valign=’middle’>”;}
print “<td bgcolor=’#CCCCCC’>”;
if ($date > 0 && $date <= $month[$m]){print $date;} else {print “\&nbsp\;”;}
print “</td>\n”;
if ($i == 1){print “</tr>”;}

if ($date == $month[$m]){$offset = 8 – $i;}
} # weekly loop

} # fill loop
print “</table>”;
} # monthly loop

Old English Parser 5



Update 5/2017. Now that the static state parser is working, time to move on to the syntax. The next big issue is language. So far, I’ve written in PERL. Love it. Works brilliantly. But much of the available open-source code is in python. NLTK is in python. You can tokenize with it. Parse with it. The Helsinki parsed OE corpus is in it. And it loads into a script easy-peasy. Very tempting to switch horses in mid stream.

Second issue: I’m moving the parser from UNIX to LINUX.

I’m putting the parser on a sandboxed Dell that’s about 10 years old and about $150 on Amazon. Fine computer that still runs Ubuntu! As an aside, I can configure the IP tables to reject traffic from China, which cuts back significantly on lost processor time spent dealing with robot hack attempts every 2 seconds. Here hacker “tomcat” is calling from, which is Chinanet, No.31, Jingrong Street, Beijing

Two seconds later, the next caller, at port 36116:11, is from the same server. Between this sort of nonsense and the spam, it’s a wonder academic computing can do anything else.

Now that the basic parser is working, it needs fine-tuning. First, the lists of closed-class words need to be improved. Pronouns are especially difficult to differentiate, and I wonder if I need to. Relative, interrogative, demonstrative—are these required to assemble phrases?

Second, the list of PRP is too short. Mitchell & Robinson have a good list in their textbook.

Third, the order of assembly for the syntax portion needs some thought. Obviously, closed class words must come first. After pronouns, prepositional phrases might be easiest to isolate. The trick is to try to think like an undergrad: what were the first steps? Small closed-class words (PRN, CNJ, PRP), then obvious inflections (-um, -ode, -ost, etc).

Update 6/20/2017. It’s working. Parser finds possible inflections and declensions based on suffixes, then cheks for closed class, then checks for prepositional phrases. Still a long way to go. Noun phrases, the right bounary of PP, suboptimal parses, and so on. But it’s working. Here it is: http://www.bede.net/misc/dublin/parse.html