In this paper we explored briefly the problems of subject access and the utility of indexer consistency as a possible useful measure of indexing quality. Next we presented a basic lexical collocation technique along with an effective statistical measure for identifying such collocations.
We then described a two stage training and deployment algorithm which applied this technique in a novel way by selecting the events to be associated from different parts of document representations instead of the usual linearly adjacent events to construct a dictionary of associations which could be used for retrieval.
Finally, we applied this algorithm to a test collection in order to evaluate it in terms of observed human performance for a similar task. Given some set of clues about a document, the algorithm predicted which controlled vocabulary subject headings ought to be assigned, which we compared to actual assignments made by human catalogers. The evaluation showed that even in less than ideal circumstances such an algorithm can perform at reasonable levels of success and warrants further study.
In addition, we discussed several possible uses for such a technique and several possible ways in which we hope to improve it. We plan to continue along this line of research by exploring several variations to the basic algorithm. Among the paths remaining to be explored are more sophisticated natural language processing techniques for better lexical clue extraction (e.g. instead of just the words from document subpart, we plan to try bigrams, noun phrases, etc.), the effects of stemming, domain specific stopword lists and other statistical methods. In addition, we plan to apply this technique to several other purely bibliographic collections without abstracts in order to evaluate its usefulness for indexing in such situations. We also think it could function effectively as an entry vocabulary module that maps foreign language terms onto indexing terms (e.g., Library of Congress subject headings).
We would like to acknowledge and thank Michael Buckland, Michael Cooper, Marti Hearst, Youngin Kim and Ray Larson for their encouragement, thoughtful comments and suggestions on various drafts of this paper. The first author's research was sponsored in part by the joint NSF/NASA/ARPA Digital Libraries project, grant number IRI-9411334, the second author's by the ongoing Oasis project grant, HEA IID R197D40008, ``Online access in multiple database environments''.