Next: 3 The Algorithm Up: An Association Based Previous: 1 Introduction

2 Statistical Association Techniques

One area of research related to information retrieval that has received some attention is that of applications that employ statistical term co-occurrence. For the most part, term co-occurrence has been used as a query expansion technique. The general approach has been to expand a user's submitted query with ``synonyms'' which have been found to co-occur with the terms actually submitted. Overall, this technique has met with mixed results [Lesk1969,Sparck-Jones1971,Robertson et al.1981,Salton1986,Peat & Willett1991].

Another area, which has also received a good deal of attention, though only sporadically from the perspective of information retrieval, is that of lexical collocation. A lexical collocation, defined broadly, ``is an arbitrary and recurrent word combination'' [Benson1990]. In addition to being arbitrary and recurrent, lexical collocations are domain dependent and tend to form cohesive clusters [Smadja1993]. By cohesive clusters, we mean that the appearance of one (or more) part of the collocation tends to suggest the rest of the collocation, that the presence of one part offers ``evidence'' for the possible presence of the whole collocation.

In the case of lexical collocations, where such events are typically words, the higher the strength of the association between the occurrence of two (or more) words in some syntactic relation, the higher the probability that the collocation is important as evidence. In terms of purely statistical approaches, the ``syntactic relation'' can be relaxed to mean ``appear in linear succession in the same sentence''. In the case of co-occurrence query expansion, the terms are not as a rule constrained to be in any sort of syntactic or semantic relation (though this is clearly the intent), and hence, tend to produce only marginal results. Those experiments which do show good results are those which constrain the associations to a greater extent [Rada & Bicknell1989].

In the present discussion, we will be dealing with collocations not in the usual sense of lexical adjacency, but in a broader sense of document features which have a measurable association.

2.1 Representing Collocations

In the following two sections, we follow closely Dunning's presentation and notation for non-parametric statistical techniques applied to word collocations [Dunning1993]. At the heart of these techniques is the observation that when events that co-occur with greater frequency than can be accounted for by chance, they are likely to be highly associated. Given two events A and B, if the probability P that they will appear together is greater than the probability that they will appear independently, , then they are to some degree positively associated. If, on the other hand, A and B are independent, then , these probabilities are equal. The intent is not just to determine if two events are associated, but to measure the strength of the association and thereby highlight certain highly associated pairs of events.

For our statistical analysis below, we construct a contingency table which contains the counts for each of the possible combinations of events A and B of the form:

where ``'' denotes the absence of some event. The possible combinations are AB, where the events both occur; , where event A occurs without B; , where B occurs without A; and finally, where neither A nor B occurs.

For the training phase of this research, we take event A to be some lexical item (document subcomponent) identifiable in the document at hand (words or phrases taken variously from the titles, authors or abstracts), and event B to be a member of a list (set) of controlled vocabulary subject headings which has previously been assigned to that document by a human indexer. Such a formulation allows us to determine to what degree the lexical components of a particular document are highly associated with which members of the controlled vocabulary. However, for the deployment phase, we need to go further. First, when trying to predict which controlled vocabulary subject heading ought to be associated with a particular document, we need to reason not from a single lexical item (as might be the case with a single word title) but from several lexical clues. Lexical clues can be drawn from the words in a title or abstract or both. Second, we don't need to determine which event pairs are or are not correlated as much we need to rank the correlations. Fortunately, the statistical measure of correlation can be used quite naturally to deal with both of these concerns.

2.2 The Likelihood Ratio

One of the problems in applying some standard statistical tests, such as the and z-score tests, to text is the assumption that the random variables being sampled, the words of the text (or texts), conform to these distributions. Such tests break down because of the large proportion of ``rare'' events, i.e. words that occur infrequently, in text. It has long been recognized in information retrieval research that the frequency of words in texts follows a Zipf curve: [Luhn1958,Salton1989]. This means that the so-called ``rare'' events in text are very common. For example, Salton reports a sample of words of running text with 50,406 unique terms of which 22,543, nearly half, occur only once [Salton1989]. The importance of such ``rare'' occurrences are greatly overestimated by tests which assume a normal distribution.

In order to avoid this problem, we cast the training part of our experiment as a binomial event counting problem where each event A being counted is compared to a particular event B. This allows us to measure the independence of events A and B from each other by comparing the distribution of event A given event B, , to the distribution of event A given the absence of B, , where A is some clue (lexical item) in the particular document at hand and B is a controlled vocabulary subject heading which has previously been assigned to this document by a human indexer. We take as the null hypothesis that A and B are independent and compare their observed distributions to determine the strength of association between them.

For this purpose we adopt the likelihood ratio for comparing these two binomial processes as developed by Dunning as our test statistic [Dunning1993]. The generalized likelihood function is written as where represents the model parameters (hypothesis) and k the observations being tested. The likelihood ratio is the maximum value of the likelihood function for the particular hypothesis being tested over the maximum value of the function over the entire parameter space

where is the hypothesis being tested and is the entire parameter space. In the current case, the likelihood function for a binomial process is given as

where p is model parameter (expected probability of a particular outcome), k, the number of positive observations and n the total number of observations. This likelihood ratio for such a function is the maximum value of this function for the hypothesis being tested over the maximum value of the function over the entire parameter space. For two binomial processes, the function becomes

By using this function to test the null hypothesis that the parameters and are equal, we can use it in the likelihood ratio. The maximum value of this function over the entire parameter space (i.e. the two binomial processes) is the observed ratio of positive observations to total observations for each process and . The maximum value for the particular hypothesis being tested, that , is given as . That is,

which, after inserting the binomial functions, can be reduced to

The logarithm of this ratio gives

where

and as previously noted

At this point, the only unknown values, and are the frequency observations taken directly from the contingency tables. and , correspond to the observed distribution of A given B: , the number of positive observations, and , the total observations. and take their values from the distribution of A given that B is not present: and . For any given contingency table, we can now easily calculate the statistic. We interpret this statistic as a ``weight'' of association between two events and use it as a predictor of relevance between them, where the two events are a lexical clue (A) and a controlled vocabulary subject heading (B).

We also include a representative example using the test as a comparison to demonstrate the effectiveness of this method in Section 5.1.

Next: 3 The Algorithm Up: An Association Based Previous: 1 Introduction

Christian Plaunt
School of Information Management and Systems
UC Berkeley
chris@www.sims.berkeley.edu
Wed Dec 20 16:53:25 PST 1995