Bigrams

9 March 2019. Puzzling out a word jumble, I’m writing a python script to search a grid for words. Step one is to compile a list of legal bigrams in English. Bigrams are two letters that go side-by-side. So the letter <Q> in English has a limited list of bigrams. We see <QU> as in quit, <QA> as in Qatar (and a few others if you allow very rare words).

I found a huge list online of English words compiled from web pages. 2.5 megs of text file! Here is the resulting python dict of bigrams:

{'A':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
'B':['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'G', 'P', 'Z', 'Q'],
'C':['A', 'I', 'K', 'T', 'U', 'E', 'O', 'Y', 'H', 'C', 'L', 'M', 'N', 'Q', 'R', 'S', 'D', 'B', 'W', 'Z', 'G', 'P', 'F'],
'D':['V', 'W', 'E', 'I', 'O', 'L', 'N', 'A', 'U', 'G', 'Y', 'R', 'P', 'C', 'D', 'F', 'H', 'J', 'M', 'S', 'T', 'Z', 'B', 'K', 'Q'],
'E':['H', 'R', 'D', 'N', 'E', 'S', 'M', 'Y', 'V', 'L', 'A', 'C', 'I', 'P', 'T', 'K', 'Z', 'U', 'G', 'W', 'B', 'F', 'O', 'X', 'Q', 'J'],
'F':['F', 'T', 'A', 'U', 'O', 'E', 'I', 'Y', 'L', 'G', 'R', 'S', 'W', 'Z', 'N', 'V', 'H', 'B', 'K', 'D', 'M', 'J', 'P', 'C'],
'G':['I', 'E', 'H', 'L', 'N', 'A', 'Y', 'O', 'R', 'M', 'U', 'S', 'D', 'G', 'K', 'P', 'B', 'W', 'T', 'F', 'C', 'V', 'J', 'Z'],
'H':['R', 'E', 'L', 'M', 'I', 'Y', 'O', 'U', 'A', 'T', 'N', 'S', 'W', 'B', 'P', 'Z', 'G', 'C', 'F', 'D', 'H', 'J', 'K', 'V', 'Q'],
'I':['C', 'T', 'N', 'S', 'O', 'E', 'A', 'Z', 'R', 'L', 'D', 'U', 'P', 'G', 'B', 'V', 'F', 'M', 'I', 'X', 'K', 'Y', 'W', 'H', 'Q', 'J'],
'J':['E', 'O', 'U', 'A', 'I', 'H', 'J', 'R', 'Y', 'P', 'D', 'M', 'W', 'L', 'T', 'N', 'B', 'K'],
'K':['A', 'H', 'E', 'I', 'Z', 'M', 'N', 'B', 'S', 'L', 'O', 'C', 'K', 'P', 'R', 'T', 'U', 'W', 'Y', 'D', 'F', 'G', 'J', 'V'],
'L':['F', 'L', 'U', 'I', 'O', 'E', 'Y', 'A', 'M', 'T', 'S', 'N', 'V', 'C', 'D', 'B', 'G', 'H', 'P', 'R', 'K', 'W', 'J', 'Q', 'Z', 'X'],
'M':['A', 'P', 'E', 'B', 'I', 'O', 'H', 'U', 'Y', 'M', 'S', 'T', 'F', 'L', 'W', 'N', 'R', 'C', 'G', 'V', 'K', 'D', 'J', 'Z', 'Q'],
'N':['I', 'A', 'C', 'E', 'D', 'T', 'U', 'O', 'S', 'R', 'G', 'Y', 'M', 'N', 'Z', 'L', 'P', 'K', 'F', 'H', 'Q', 'B', 'J', 'V', 'X', 'W', '-'],
'O':['L', 'N', 'R', 'S', 'I', 'M', 'T', 'U', 'G', 'O', 'W', 'A', 'B', 'D', 'H', 'V', 'X', 'C', 'K', 'Z', 'P', 'Y', 'E', 'F', 'Q', 'J'],
'P':['E', 'T', 'O', 'Y', 'I', 'H', 'S', 'R', 'A', 'N', 'U', 'L', 'P', 'M', 'J', 'B', 'D', 'F', 'W', 'K', 'C', 'G', 'V', 'Q'],
'Q':['U', 'I', 'A', 'R', 'E', 'O', 'Q'],
'R':['D', 'O', 'U', 'E', 'A', 'I', 'T', 'Y', 'R', 'S', 'V', 'M', 'B', 'P', 'G', 'N', 'H', 'L', 'F', 'W', 'C', 'K', 'J', 'Q', 'X', 'Z'],
'S':['C', 'T', 'A', 'E', 'S', 'I', 'G', 'H', 'K', 'O', 'M', 'U', 'F', 'Q', 'V', 'Y', 'P', 'L', 'N', 'B', 'W', 'D', 'R', 'J', 'Z'],
'T':['E', 'I', 'O', 'H', 'A', 'T', 'U', 'C', 'N', 'S', 'R', 'M', 'L', 'Y', 'B', 'P', 'F', 'W', 'K', 'Z', 'D', 'G', 'J', 'V', 'Q', 'X'],
'U':['A', 'S', 'L', 'R', 'C', 'M', 'N', 'D', 'T', 'E', 'V', 'P', 'Z', 'B', 'I', 'O', 'X', 'G', 'K', 'F', 'Y', 'W', 'J', 'H', 'Q', 'U'],
'V':['A', 'E', 'I', 'O', 'U', 'Y', 'S', 'R', 'C', 'L', 'V', 'N', 'Z', 'D', 'K', 'G'],
'W':['O', 'H', 'A', 'E', 'I', 'L', 'N', 'S', 'T', 'R', 'M', 'U', 'Y', 'B', 'P', 'W', 'D', 'F', 'K', 'C', 'G', 'Z', 'Q', 'V', 'J'],
'X':['I', 'A', 'Y', 'T', 'E', 'O', 'U', 'M', 'P', 'C', 'B', 'F', 'H', 'L', 'S', 'W', 'R', 'D', 'K', 'N', 'G', 'Q', 'Z', 'V'],
'Y':['S', 'M', 'A', 'R', 'C', 'P', 'G', 'I', 'L', 'N', 'D', 'T', 'X', 'O', 'E', 'Z', 'U', 'F', 'W', 'H', 'B', 'Y', 'K', 'V', 'J', 'Q'],
'Z':['E', 'A', 'U', 'Z', 'I', 'O', 'L', 'G', 'Y', 'R', 'H', 'T', 'N', 'B', 'D', 'P', 'K', 'C', 'M', 'V', 'S', 'F', 'W']
}

And here is the code to get the bigrams (my file of words is called web2.txt, and each word is on a separate line). In order to limit the bigrams to a list of unique letters, I use set().

import os

path = os.getcwd()
path += '/web2.txt'

bigrams = {'A':[], 'B':[], 'C':[], 'D':[], 'E':[], 'F':[], 'G':[], 'H':[], 'I':[], 'J':[],
           'K':[], 'L':[], 'M':[], 'N':[], 'O':[], 'P':[], 'Q':[], 'R':[], 'S':[], 'T':[],
           'U':[], 'V':[], 'W':[], 'X':[], 'Y':[], 'Z':[]}

with open(path, 'r') as allwords:
    words = allwords.read().split('\n')
    allwords.close()

for letter in bigrams.keys():
    letter = letter.upper()

    for word in words:
        word = word.upper()
        if letter in word:
            if word.index(letter) < len(word):
                try:
                    nextletter = word[word.index(letter) + 1]
                    if nextletter not in set(bigrams[letter]):
                        bigrams[letter].append(nextletter)
                except IndexError:
                    continue

    print('\'{0}\':{1}, '.format(letter, bigrams[letter]))

Bigrams

March 6. An interim step in making a semantic map of Old English is producing bigrams. Bigrams are pairs of words. In order to build a social network of words, you need to know which words connect to one another. For example, in Beowulf, the word wolcnum ‘clouds’ almost always sits next to under ‘under’.

By comparison, the epic poem Judith has no clouds in it. And the homilist Ælfric never uses the phrase under wolcnum.

Here is a screen shot of words that follow ic ‘I’ in the poem Beowulf. So, the first is “ic nah.”

You can see that there are 181 instances of ic, although only 80 are unique. In other words, some bigrams are repeated. The second word of the bigram is printed again in red, and passed to a part-of-speech tagger. The blue text is the tagger’s best guess, and it also returns the part-of-speech most cited by dictionaries. As I plan to discuss in an article, ic is very rarely followed by a verb.

We can discover a great deal about poetic style by looking very closely at the grammar of Old English poetry. The grammar is the unfolding in time of images and ideas and asides and so forth. Grammar describes how the words affect you in order as you read.

Poetic Words

sort | uniq

Has anyone has done this since Angus Cameron suggested it in 1973? I separated the Corpus of Old English into genres and sub-genres. It enabled me to find words unique to poetry. The poetic texts are largely from the ASPR, but include Chronicle poems, the Meters of Boethius, and others.

First, I sorted the words into alphabetical order and removed duplicates. Second, I did the same for all prose texts. I also removed all foreign words from the prose texts—those are words that the Dictionary of Old English designated as foreign by placing them within <foreign> tags. Third, I compared prose words with poetic words. The resulting list is a set of all words used only in the poetic texts. Here is the file (right-click to download): PoeticWords

The next step is to classify each word by word class. That will allow me to differentiate verbal phrases from noun phrases in the poetry. Once noun phrases are isolated, I can begin to build a semantic map of poetic discourse in Old English. Afterwards, I’ll add verb phrases. So we’ll be able to know how OE poets described queens (adjectives) and what sort of acts queens performed (verbs), and compare that to descriptions of kings and the acts they performed. We can then further differentiate dryhten from cyning, and cwen from ides. But there’s a big caveat.

Because Old English poets wrote alliterative verse, adjectives and verbs may have been chosen simply on account of their initial sound. So, cwen may have attracted /k/-initial words. That is why it is essential to also build a map in prose of cwen. Since the formal structure of prose was not governed by alliteration (with the possible exception of Ælfric), the map in prose and the map in poetry of any given noun might well be distinct.

Fulbright Project

View from Dunton Tower at Carleton University looking north along the Rideau River towards the city of Ottawa.

I am very fortunate this year to have received a Fulbright award. The College of Humanities and Fine Arts at UMass made it possible for me to spend the academic year at Carleton University in Ottawa, Ontario, Canada. While here, I’m working on a natural-language parser of Old English, which I will use to create a semantic map of Old English nouns. In short, I want a computer to recognize an Old English noun and then find all words associated with it. Nouns are names for entities in the world. So a semantic map tells us something about how a language permits people to associate qualities with entities.

Following in the footsteps of Artificial Intelligence researchers like Waleed Ammar of the Paul Allen Institute, I will be using untagged corpora—that is, texts that no one has marked up for grammatical information. I would like to interfere with the data as little as possible.

What makes this project different from similar NLP projects is my aim. I want to produce a tool that can be used by literary critics. I am not interested in improving Siri or Alexa or a pop-up advertisement that wants to sell you shoes. Neither is my aim to propose hypotheses about natural languages, which is a general aim of linguistics-related NLPs. So, the object of my inquiry is artful writing, consciously patterned language.

STAGE ONE

The first stage is to write a standard NLP parser using tagged corpora. The standard parser will serve to check any results of the non-standard parser. Thanks to the generosity of Dr. Ondrej Tichý of Charles University in Prague, the standard parser is now equipped with a list of OE lexemes, parsed for form. A second control mechanism is the York-Helsinki Parsed Corpus of Old English, which is a tagged corpus of most of Aelfric’s Catholic sermons.

STAGE TWO

At the same time, I divided the OE corpus into genres. In poetic texts, dragons can breathe fire. But in non-fictional texts, dragons don’t exist. So a semantic field drawn around dragons will change depending on genre. I am subdividing the poetry according to codex, and then according to age (as far as is possible) to account for semantic shift. Those subdivisions will have to be revised, then abandoned as the AI engine gets running. (I’ll be using the python module fastai to implement an AI.)

Notes

Unicode. You’d think that searching a text string for another text string would be straightforward. But nothing is easy! A big part of preparing the Old English Corpus for manipulation is ensuring that the bits and bytes are in the right order. I had a great deal of difficulty opening the Bosworth-Toller structured data. It was in UTF-16 encoding, similar to the character encoding used on Windows machines. When I tried to open it via python, the interpreter threw an error. It turns out, Unicode is far more complex than one imagines. Although I can find ð and þ, for example, I cannot find them as the first characters of words after a newline (or even regex \b). Another hurdle.

Overcame it! The problem was in the data. For reasons unknown to me, Microsoft Windows encodes runic characters differently than expected. So the solution was to use a text editor (BB Edit), go into the data, and replace all original thorns with regular thorns. Same for eth, asc, and so forth. Weirdly, it didn’t look like I was doing anything: both thorns looked identical on the screen.

 

Screen shot of parser guts so far. Markup is data from Tichy’s Bosworth-TollerInflect receives Markup, then returns inflections based on the gender of strong nouns. Variants and special cases have not yet been included.

To finish STAGE ONE, I’ll now inflect every noun, pronoun, and adjective, then conjugate every verb as they come in (on the fly). Andrej Tichý at Charles University in Prague, who very generously sent me his structured data, took a slightly different approach: he generated all the permutations first and placed them into a list. Finally, as a sentence comes in, I’ll send off each word to the parser, receive its markup and inflections/conjugations, then search the markup for matches.

Old English Parser 4

The next step is to isolate syntactic clusters. These clusters comprise phrases, words in a series, and so forth. An example is a prepositional phrase (PP). A PP is made up of a preposition (PRP) and an object noun. It may also contain any number of modifiers (adjectives, determiners, pronouns) and conjunctions (CNJ). So, “I saw the house of oak and stone.” The PRP is “of,” the CNJ is “and,” and both “oak” and “stone” are dative object nouns—called objects of the preposition. We read from left to right in English. So we read “of” then process the rest of the PP. A parser should do the same.

In Old English, we also read from left to right. Consequently, the position of each word in a sentence and a phrase is important. Although OE poetry allows for significant hyperbaton (messing up the syntax), it is rare that a preposition would sit after its object(s). In fact, prepositons are relatively unusual in poetry since a dative or accusative inflection already gives that information to the reader. Consequently, the sentence will be held in an indexed array. The left-most word will be read in first and assigned the lowest number.

The first syntactic cluster I’ll attempt is the prepositional phrase. The reason to isolate this phrase first is that neither the main verb nor the subject or the direct object of a sentence can be placed in a PP. By knocking the PP out of contention, we have a better chance at fiding the subject, object, and verb.

A big challenge here is how to present the results.

Here’s one option:

PP | of [PRP] oak [N{dat}] and [CNJ] stone [N{dat}]

That presentation can then be turned into html. The sentence could be printed and the PP appear in, say, green type. And when a user hovers over a word, its grammatical information could be revealed. But that’s inelegant, since the user has to interact with the data in order to make it productive. (The user already knows the sentence, so why return it again?)

Another option is to present the data in a table (which I can’t do easily in this blog).

PP | PRP | Nd | CNJ | Nd

…..| of | oak | and | stone

The cells of the table can then be colored green for a PP. In this case, the important data is returned first (the parsed results), and the sentence is returned as a reminder to the user.

Because the parser is intended to return optimal and suboptimal results, the table will have to have room for all of them. At the moment, I’m thinking of returning each complete parse as a table row, beginning with the optimal result, then proceeding downwards to the least optimal result. Here’s an example.

Wiht
unhælo
grim
CNJ
ond
grædig
gearo
sona
wæs
reoc
CNJ
ond
reþe
CNJ
ond
PRP
on
ræste
genam
þritig
þegna
þanon
ADV
eft
gewat
huðe
hremig
PRP
to
ham
faran
PRP
mid
ART
þære
wælfylle
wica
neosan

A sentence from Beowulf with the conjunctions (CNJ), prepositions (PRP), some adverbs (ADV), and articles/determiners (ART/DET) marked. You’ll notice that sona is an adverb, as is þanon. The old parser missed those.

Meanwhile, I have only just discovered a NLP parser from the CS Dept at UMass. It is called Factorie (Home Page). It appears to use the NLP server familiar to users of python. Although it operates on North-American English, it can be trained to parse other languages. That training may be more time-intensive than writing my own parser.

Old English Parser 2

stormtrooperSIMPLIFICATION. It is tempting to create lists that would quickly distinguish forms. For example, OE eadig ‘blessed’ is obviously an uninflected adjective (–ig). It could be almost nothing else. But the task of discerning its word class belongs to the syntactic strata, to syntax.pl. At the most basic level, eadig could be noun with a consonantal stem. It could be a name (> Eddie). So, at the most basic level, a NLP has to be greedy. It has to take in as many options as possible. Only then can the syntactic strata apply constraints.

A second example is swa. The word can be a conjunction ‘because, so that’ or an adverb ‘thus.’ So it has to be listed in both files—conjunctions and adverbs. The bottom strata of the parser has to return both possibilities to the calling script, ADV and CNJ.

Some words are clearly one thing or another. Gea ‘yes’ is nothing else but ‘yes’. Ne is a negative particle. At the moment, I have those words listed in the grammar script that calls out for the lists. But I will eventually put them all in their own list (e.g., uninflected.txt). Thus, for now:

if ($word =~ /^ne\b/i){push @results, “NEG UNDCL\n”;}
if ($word =~ /^gea\b/i){push @results, “INTJ UNDCL\n”;}
if ($word =~ /^eala\b/i){push @results, “INTJ UNDCL\n”;}

These lines say: if the word matches the pattern between the forward slashes, then add to the buffer of results the following. Later, in the syntax strata, I will have to check if there are two ne‘s (ne … ne ‘neither … nor’), since that will have to be parsed as a correlative conjunction. Similarly, swaþa, and so forth. Correlative conjunctions and compound adverbs will be listed in a rule set. That is the single most challenging portion of the parser: the rule set.

At this point, I don’t want the parser to worry about what a word is; I want it to list what it could be.

Update: Perl is such a thoughtful language that I was able to simpify the architecture further. Initially, I had created word lists. Then, I checked whether a target word was on one of the lists. If it was on the conjunction list, I returned “cnj” to the calling function. But there was no need for the extra layer. I turned the lists into if-then statements, and ran them as perl scripts from grammar.pl. No reason to repeat the information: just place it once inside a script. If a word is there, return the word’s class and grammatical information. No more text files. No more filehandles. No more closed class, open class questions. It’s reading all closed-class words, all nominal infletions, and all adjectival inflections.

I’ve got the whole parser down to 24.5 KB! Small as an icon! But because it executes 14 scripts, it’s already slowing down. About 1.8s to execute on a single word.