The University of Massachusetts Amherst
Categories
Computational Corpus Linguistics Parsing Syntax

TGrep2

In response to Lyn’s query about possible positional effects for distributive phrases, I thought I’d post a short bit on how that information might be found with tools that are readily available. TGrep2 is a utility that allows you to conduct a regexp-like search of corpus that’s parsed in Penn Treebank style, and it’s really useful for asking these sorts of questions. Doug Roland has very helpfully posted some executables of TGrep2. For people using intel-based Macs, this is the probably the simplest way to install the tool on your computer. Download the executable, name it tgrep2, make sure it has the right permissions with chmod (executable), and put it in /usr/local/bin. If it’s installed correctly, you should be able to type tgrep2 from a command prompt and have it display some help.

Once installed, you only need to get a corpus to search. There are several in the lab, including the Penn Treebank, the Brown Corpus, the Switchboard corpus, and the Penn Historical corpora; let me know if you’d like access to them in .t2c format (TGrep2 corpus format). Let’s say you’ve gotten your Wall Street Journal corpus, wsj.t2c, and you want to know how often the distributive phrases ‘neither (of) X’ and ‘both (of) X’ occur in subject position, compared to direct object position. To find out about ‘neither’ in subject position, let’s do a search for all instances of ‘both’ that are directly dominated by an NP-SBJ node in the tree:

tgrep2 -c wsj.t2c 'NP-SBJ < (NP < (DT < (/[bB]oth/)))'

This performs a search in the corpus specified by -c wsjt2c, and asks for all instances of ‘both’ or ‘Both’ that are directly dominated by at determiner (DT) node, an NP node, and a subject NP node (NP-SUBJ), in that order. The output prints out the constituents that match the search query. Another thing you might want is a count of constituents that match the search in the corpus, which you can get by piping the output to wc:

tgrep2 -c wsj.t2c 'NP-SBJ < (NP < (DT < (/[bB]oth/)))' | wc

This tells us there are 17 lines in the output returned by the TGrep2 command, so 17 constituents that match in the corpus. It might also be useful to look at the entire sentence context of each of the hits you get:

tgrep2 -c wsj.t2c 'S << (NP-SBJ < (NP < (DT < (/[bB]oth/))))'

Which returns the S node that dominates (<<; as opposed to immediately dominates <) the desired constituent. Inspecting the output shows that there are 17 instances of both as the determiner of the subject NP, and that comprises a variety of ‘both (of) X’ and extraposed ‘X both’ NPs. Now if we wanted to compare the number of instances in which this constituent structure appears in direct object position, we’d look for NPs that are directly dominated by VP:

tgrep2 -c wsj.t2c 'VP < ( NP < (NP < (DT < (/[bB]oth/))))'

tgrep2 -c wsj.t2c 'VP < ( NP < (NP < (DT < (/[bB]oth/))))' | wc

Which shows there are fewer instances (5) in DO position. Looks like in this corpus, both tends to occur in subject position. It might be useful to compare this to another quantifier, say ‘all’.

tgrep2 -c wsj.t2c 'NP-SBJ < (NP < (DT < (/[aA]ll/)))' | wc
tgrep2 -c wsj.t2c 'VP < ( NP < (NP < (DT < (/[aA]ll/))))' | wc

This reveals comparable numbers of all in subject position (104) and DO position (125), suggesting there isn’t a general bias for quantifiers to be in subject position. Of course, these conclusions would need to be strengthened by examining different quantifiers: for instance, another distributive quantifier, ‘each’, seems to appear equally likely in subject (19) and DO positions (16). Likewise, it’s useful to examine different corpora to make sure the generalizations are robust; in particular, the low number of ‘both’ or ‘each’ as a determiner for an argument NP (23 pr 35 instances) here might lead to misleading generalizations. But the easy availability of the corpora and relative ease of using TGrep2 make addressing these questions relatively straightforward.

Leave a Reply

Your email address will not be published. Required fields are marked *