Update 5/2017. Now that the static state parser is working, time to move on to the syntax. The next big issue is language. So far, I’ve written in PERL. Love it. Works brilliantly. But much of the available open-source code is in python. NLTK is in python. You can tokenize with it. Parse with it. The Helsinki parsed OE corpus is in it. And it loads into a script easy-peasy. Very tempting to switch horses in mid stream.
Second issue: I’m moving the parser from UNIX to LINUX.
I’m putting the parser on a sandboxed Dell that’s about 10 years old and about $150 on Amazon. Fine computer that still runs Ubuntu! As an aside, I can configure the IP tables to reject traffic from China, which cuts back significantly on lost processor time spent dealing with robot hack attempts every 2 seconds. Here hacker “tomcat” is calling from 188.8.131.52, which is Chinanet, No.31, Jingrong Street, Beijing
Two seconds later, the next caller, 184.108.40.206 at port 36116:11, is from the same server. Between this sort of nonsense and the spam, it’s a wonder academic computing can do anything else.
Now that the basic parser is working, it needs fine-tuning. First, the lists of closed-class words need to be improved. Pronouns are especially difficult to differentiate, and I wonder if I need to. Relative, interrogative, demonstrative—are these required to assemble phrases?
Second, the list of PRP is too short. Mitchell & Robinson have a good list in their textbook.
Third, the order of assembly for the syntax portion needs some thought. Obviously, closed class words must come first. After pronouns, prepositional phrases might be easiest to isolate. The trick is to try to think like an undergrad: what were the first steps? Small closed-class words (PRN, CNJ, PRP), then obvious inflections (-um, -ode, -ost, etc).
Update 6/20/2017. It’s working. Parser finds possible inflections and declensions based on suffixes, then cheks for closed class, then checks for prepositional phrases. Still a long way to go. Noun phrases, the right bounary of PP, suboptimal parses, and so on. But it’s working. Here it is: http://www.bede.net/misc/dublin/parse.html