In our 2012 SIGMORPHON paper, we propose the following measure of categorical success in MaxEnt learning with hidden structure, in this case Underlying Representations (URs) given only observed Surface Representations (SRs) (pp. 67-68):

Our objective function is stated in terms of maximizing the summed probability of all (UR, SR) pairs that have the correct SR, and an appropriate criterion is therefore to require that the summed probability over full structures be greater for the correct SR than for any other SR. We thus term this simulation successful. We further note that given a MaxEnt grammar that meets this criterion, one can make the probabilities of the correct forms arbitrarily close to 1 by scaling the weights (multiplying them by some constant).

Unfortunately, the claim in the last sentence in false, and our success criterion does not seem stringent enough, since a grammar that meets it is not necessarily correct in the sense we would like.

Here’s a simple counter-example to that claim, involving metrical structure rather than URs. We have a trisyllable that has two parsings that generate medial stress, and a single parsing that gets us each initial and final. Stress is a capital A, and footing is shown in parentheses. These probabilities come from zero weights on all constraints, except “Iamb”, which wants the foot to be right-headed, and thus penalizes candidates 2 and 3. Here Iamb has weight 0.1.

1. batAma (batA)ma 0.2624896

2. batAma ba(tAma) 0.2375104

3. bAtama (bAta)ma 0.2375104

4. batamA ba(tamA) 0.2624896

The summed probability of rows 1. and 2. is 0.50, and thus this grammar meets our definition of success if the target language has medial stress. But no matter how high we increase the weight of Iamb, we will never get that sum to exceed 0.50 (another demonstration would have been just to have the weights at zero, since scaling will have no effect, and batAma will again have 0.50 probability). A correct grammar in the sense we would like also needs to include non-zero weight on a constraint that prefers 1. over 4 (e.g. Align-Left).

So what’s the right definition? One obvious possibility would be to require a single correct candidate to have the highest probability, which corresponds to a categorical version of HG (see this paper for some discussion of the relationship between categorical HG and MaxEnt), but that seems wrong given our objective function, which doesn’t have that structure (though see my comment on this post for more on this). Another would be to require some arbitrary amount of probability on the correct form, but we could construct another counter-example simply by making the set of parses that correspond to one overt form sufficiently large w.r.t. to the others. It seems the right answer would involve knowing the conditions under which it is in fact true that scaling will bring probabilities arbitrarily close to 1, but I don’t know what they are when hidden structure is involved.

Categorical success = categorical HG?

One simple way of measuring categorical success would be to require that the learned grammar give highest probability to one fully structured candidate corresponding to the correct overt form. In the batama space, it looks like this is required in order to get a set of weights that can be scaled to give batAma probability approaching 1. For example, here are the probabilities that we get if we give Iamb and Align-L each weight 0.1.

1 batAma (batA)ma 0.2756031

2 batAma ba(tAma) 0.2256448

3 bAtama (bAta)ma 0.2493760

4 batamA ba(tamA) 0.2493760

As we scale these weights, we’ll put more and more probability on candidate 1. I believe it’s impossible to find weights that when scaled will make the summed probability of 1 and 2 approach 1, without one of those candidates being preferred over the other (because we need to sufficiently penalize either left or right alignment of the foot, and either a left- or right-headed foot, to eliminate 3. and 4.)

This is a simple measure of correctness, but are there cases of “success” that it would rule out? That is, in which weights can be scaled to make the summed probability of two or more candidates corresponding to a single overt form approach 1, but neither one is preferred over the other? I suspect there are, but I haven’t come up with one yet.