About three-quarters there

Screen shot 12/2/2018.You’re only as good as your data

That is the lesson here. Single brackets [x] indicate an entry in Ondrej Tichy‘s Bosworth-Toller, which I edited into a json file. Double brackets [[x]] indicate an entry in the raw data of Ondrej’s BT, if the word wasn’t found in the json file. Empty brackets indicate no returned value. A word like mæg can mean ‘may’ (V) or ‘kin’ (N). The word didn’t make the structured data, and the raw data mischaracterized it in its verbal form, so the parser didn’t pick up the verb.

Rather than spend days improving the data from Bosworth-Toller, or overwhelm the servers in Prague with BeautifulSoup requests, I’m going to scrape word lists from Old English sites, and OCR some glossaries from freely-available books. If I can compile 10 or 20 word lists and zip them to grammatical information, I can get a percentage of likelihood for any given word. Second, I can use the York-Helsinki Parsed Corpus of Aelfric’s prose through CLTK. It won’t catch all of the words, but might be a help.

I’ve written a simple script to inflect any noun or adjective and to conjugate any verb. I can work it backwards to find the root form of a word, then send that to BT.

Final step is to run the words and forms through a syntactic parser. If it sees ne, which carries a weight of 5, then it increases the likelihood that the next word is a verb, since negative particles almost always sit next to verbs in OE. (One can check that with a bigram search.) Similar proximity searches to prepositions, pronouns, and so forth help to assess weights (probabilities).

Once this next layer is completed, and the weights adjusted, I will have a decent control to check the more experimental parser.

Leave a Reply