June 28, 1998, Revised July 14, 1998
Acknowledgement: The work reported here was supported by Defense Advanced Research Projects Agency through DARPA Contract N66001-97-C-8541; AO# F477: Search Support for Unfamiliar Metadata Vocabularies.
Entry Vocabulary Modules (EVMs) are based on single term frequency occurrences. We intend to experiment with the effects of drawing on natural language processing (NLP) techniques and, where applicable, the syntactical relationships frequently used in classification and categorization schemes. We discuss the integration of natural language processing techniques into the process of building EVMs which is described in more detail elsewhere (Norgard98).
We use one general stop word list, but we anticipate the need for domain-specific stop word lists.
The Brill Tagger and the Apple Pie Parser were both trained on data from the Penn Tree Bank, a syntactically tagged corpus which includes the Wall Street Journal (WSJ) Penn Treebank Corpus and the Penn Treebank Brown Corpus. The major difference between the two is that the Apple Pie Parser is a stochastic parser and the Brill Tagger relies on linguistic rules. We are interested in the effects of this difference, if any, on the quality and effectiveness of our association dictionaries.
The Apple Pie Parser
The Apple Pie Parser (APP) was written by Satoshi Sekine and Ralph Grishman as part of the Proteus Project at New York University. The version of the Apple Pie Parser we use (APP Version 5.9) was released on April 4, 1997. It runs on Solaris, Linux, and WindowsNT.
The APP is a bottom-up probabilistic chart parser that looks for the parse tree with the best score by a best-first search algorithm. A syntactic tree with bracketting is generated. The APP tries to make a parse tree as accurate as possible for reasonable well-formed sentences (e.g., sentences in newspapers or well-written documents). It is not designed to parse many ill-formed, but still seemingly reasonable, sentences, as for example those found in typical conversation.
The Brill Tagger
The Brill Tagger is a transformation-based part of speech tagger written by Eric Brill. This tagger is widely used. It was applied to experiments investigating how WordNet can be used with contextual clues to disambiguate terms (Leacock98).
According to Brill, the tagger achieves higher rates of tagging accuracy than traditional stochastic taggers. The Brill Tagger collects linguistic information using less than 200 simple rules. The tagger is composed of a morphological analyzer-generator and a parser-generator.
Attractive features of the Brill Tagger include its modularity and extensibility. All components are available independently. The data can be retrained on any corpus. The program and its data are both easily extensible. Available for research and commercial use, the Brill Tagger is implemented in C, with some utility programs in Perl.
How information retrieval results would be affected by these differences is a different question to be considered later.
The APP identified 204,493 noun phrases, while the Brill Tagger only identified 183,085 noun phrases, a difference of 21,408. This could merely mean that the APP identified many more incorrectly or that the Brill Tagger failed to identify a large number of noun phrases that it should have.
The other interesting result of this test is that the parser and the tagger only identified 96,483 noun phrases in common. We found that 47% of the APP's noun phrases were also identified by the Brill Tagger and 53% of the Brill Tagger's noun phrases were identified by the APP.
Partial match query: information
Both dictionaries were queried with the search term: ``information''. These were partial match searches.
The table in Figure 1 shows that all the ordinary language terms returned are shared by both the Apple Pie Parser association dictionary and the Brill Tagger association dictionary. Terms in common are italicized in these tables. For example, in the APP association dictionaries the term closest associated with ``Information'' was ``Information science'', and the second closest was ``Medical information systems'', and the third closest was ``Information systems''. The Brill Tagger closely associated ``Medical information systems'' first, ``Cellular effects of radiation'' second, and ``Health care'' third. The rankings are different however, apparently reflecting a difference in frequency counts resulting from differences in parsing result with each technique. Half of the controlled vocabulary terms in the top ten are shared by both techniques: ``Information science'', ``Medical information systems'', ``Health care'', and ``Cellular effects of radiation''.
Apple Pie Parser Association Dictionary Results | ||
Rank | Noun Phrase | Search term |
78.8814 | information | Information science |
78.1168 | information | Medical information systems |
77.3415 | information systems | Medical information systems |
73.1386 | information | Health care |
70.6619 | information technology | Information technology |
62.2819 | information | Copyright |
61.9060 | information systems | Health care |
56.8259 | information technology | Health care |
55.8157 | information | Cellular effects of radiation |
53.3040 | information systems | Data privacy |
Brill Tagger Association Dictionary Results | ||
Rank | Noun phrase | Search term |
126.8530 | information systems | Medical information systems |
98.7863 | information | Cellular effects of radiation |
94.9497 | information systems | Health care |
74.7761 | information | Biomechanics |
73.1447 | information | Radiation therapy |
72.1698 | information | Information science |
71.8793 | information | Biothermics |
71.8011 | information technology | Information technology |
69.9736 | information | Medical information systems |
64.7999 | information | Bioelectric phenomena |
Exact match query: system
Exact match in this case means to search the terms that are in exactly same text form with the query. That is, the search result will show the controlled vocabulary terms that are associated with the natural terms that are exactly same as the query. As the table in Figure 2 demonstrates, an exact match query with the term ``system'', arguably a less specific term, against both association dictionaries showed very little difference in the results they returned. The controlled vocabulary term ``Medical expert systems'' was identified by the APP as a top ten term when APP had been used, but not when the Brill Tagger had been used. The Brill Tagger identified ``Dosimetry'' as a highly associated controlled vocabulary term, while the APP did not. The APP identified ``Medical expert systems'' while Brill Tagger did not. Otherwise the overlap and ranking order is strikingly similar.
Apple Pie Parser Association Dictionary Results | ||
Rank | Noun phrase | Search term |
248.2393 | system | Molecular biophysics |
195.0609 | system | Proteins |
175.4192 | system | Cellular effects of radiation |
109.9675 | system | Biomembrane transport |
94.6629 | system | Bioelectric phenomena |
88.8001 | system | Biothermics |
82.2171 | system | Physiological models |
76.5039 | system | Biological effects of gamma-rays |
73.6449 | system | Medical expert systems |
66.3451 | system | Muscle |
Brill Tagger Association Dictionary Results | ||
Rank | Noun phrase | Search term |
410.0644 | system | Molecular biophysics |
302.3192 | system | Cellular effects of radiation |
278.1254 | system | Proteins |
163.3447 | system | Biomembrane transport |
155.6857 | system | Bioelectric phenomena |
137.2306 | system | Biothermics |
130.0540 | system | Physiological models |
119.8809 | system | Muscle |
118.6029 | system | Dosimetry |
111.0834 | system | Biological effects of gamma-rays |
Partial match query: water
Partial match means truncated search. That is, the search result will include the controlled vocabulary terms associated with all the natural terms that begin with the same text with the query. Figure 3 shows the results of a query against the association dictionaries with ``water'' looking for partial matches. We see that both identified the ordinary language terms ``water'' and ``water molecules'' as the most frequently occurring terms with ``water'' in them. The APP did not identify ``Solvation'' or ``Molecular biophysics'' as potentially useful controlled vocabulary terms where the Brill Tagger did. It looks like the Brill Tagger may have done better at identifying the term ``water molecules'' as a noun phrase than the APP (five instances versus three instances).
Apple Pie Parser Association Dictionary Results | ||
Rank | Noun phrase | Search term |
227.8642 | water | Water |
139.7056 | water | Molecular biophysics |
71.3624 | water molecules | Hydrogen bonds |
66.1800 | water molecules | Water |
62.5399 | water | Cardiology |
61.9131 | water | Physiological models |
61.4642 | water molecules | Molecular dynamics method |
61.3020 | water | Cellular effects of radiation |
53.7821 | water | Neurophysiology |
51.8685 | water | Molecular dynamics method |
Brill Tagger Association Dictionary Results | ||
Rank | Noun phrase | Search term |
218.6793 | water | Water |
206.5127 | water | Biomechanics |
118.1383 | water molecules | Molecular dynamics method |
106.7155 | water | Cellular effects of radiation |
104.9031 | water | Physiological models |
98.4870 | water molecules | Solvation |
92.9816 | water | Cardiology |
92.9689 | water molecules | Water |
91.9367 | water molecules | Molecular biophysics |
82.8147 | water molecules | Biomechanics |
Exact match query: water
Figure 4 shows the results of an exact match query with ``water''. The results of the exact matching with the ordinary language term ``water'' were very similar for both dictionaries. The top six terms were the same with only minor differences in ranking. Overall, they share nine out of top ten terms.
Apple Pie Parser Association Dictionary Results | ||
Rank | Noun phrase | Search term |
227.8642 | water | Water |
139.7056 | water | Biomechanics |
62.5399 | water | Cardiology |
61.9131 | water | Physiological models |
61.3020 | water | Cellular effects of radiation |
53.7821 | water | Neurophysiology |
51.8685 | water | Molecular dynamics method |
47.3778 | water | Biodiffusion |
44.4720 | water | Muscle |
43.1905 | water | Haemodynamics |
Brill Tagger Association Dictionary Results | ||
Rank | Noun phrase | Search term |
218.6793 | water | Water |
206.5127 | water | Biomechanics |
106.7155 | water | Cellular effects of radiation |
104.9031 | water | Physiological models |
92.9816 | water | Cardiology |
82.2806 | water | Neurophysiology |
73.7564 | water | Muscle |
70.1808 | water | Medical signal processing |
68.2778 | water | Molecular dynamics method |
67.9329 | water | Haemodynamics |
Partial match query: medical
We searched with the query ``medical'' for partial matches to see what kind of noun phrases (A-terms) each technique identified. The results shown in Figure 5 for ordinary language noun phrases were very similar in this case. Both techniques identified ``medical images'' as the most frequently occurring noun phrase. They both also tagged ``medical technology'', ``medical expert systems'', and ``medical knowledge'' as frequent noun phrases. The Brill Tagger did not rank ``medical electronics'' among the top ten, favoring ``medical logic modules'' instead. The APP appears to have failed to identify ``medical logic modules'' as a noun phrase altogether.
Apple Pie Parser Association Dictionary Results | ||
Rank | Noun phrase | Search term |
93.5097 | medical images | Medical image processing |
78.8728 | medical knowledge | Decision support systems |
60.1598 | medical expert systems | Medical expert systems |
59.2303 | medical technology | Bio medical engineering |
56.9726 | medical electronics | Societies |
56.8708 | medical technology | Economics |
56.5006 | medical technology | Reviews |
55.9864 | medical expert systems | Algebra |
53.1609 | medical expert system | Medical expert systems |
50.7716 | medical knowledge | Information retrieval systems |
Brill Tagger Association Dictionary Results | ||
Rank | Noun phrase | Search term |
119.9875 | medical images | Medical image processing |
109.9154 | medical logic modules | Logic programming |
83.3679 | medical logic modules | Subroutines |
73.6530 | medical knowledge | Decision support systems |
66.9142 | medical logic modules | Program compilers |
66.0145 | medical expert systems | Medical expert systems |
63.1926 | medical imaging | Medical image processing |
62.6058 | medical logic modules | Knowledge representation |
61.0086 | medical technology | Bio medical engineering |
59.8106 | medical technology | Economics |
Partial match query: medical technology
We searched on the noun phrase ``medical technology'' for partial matches. Figure 6 shows that the top five were exactly the same for both techniques. The Brill Tagger did not identify ``medical technology and advocacy'' as a noun phrase whereas the APP did. However, it is debatable as to whether or not this should be considered a ``good'' noun phrase. Would someone ever search on such a phrase?
The result of the APP from this query includes the incorrectly identified noun phrase ``medical technology cannot''. This may be a problem with our approach to extracting tagged data or it may reflect the quality of parsing done by the APP. Further investigation is needed to determine the cause of this kind of problem.
Apple Pie Parser Association Dictionary Results | ||
Rank | Noun phrase | Search term |
59.2303 | medical technology | Biomedical engineering |
56.8708 | medical technology | Economics |
56.5006 | medical technology | Reviews |
36.2614 | medical technology | Health care |
12.2693 | medical technology profession | Economics |
12.2693 | medical technology cannot | Economics |
11.4246 | medical technology and advocacy | History |
11.3357 | medical technology management problems | Planning |
10.3361 | medical technology and advocacy | Biomedical engineering |
10.0648 | medical technology management problems | Biomedical education |
Brill Tagger Association Dictionary Results | ||
Rank | Noun phrase | Search term |
61.0086 | medical technology | Biomedical engineering |
59.8106 | medical technology | Economics |
42.6724 | medical technology | Reviews |
37.3780 | medical technology | Health care |
12.7788 | medical technology profession | Economics |
11.4749 | medical technology management problems | Planning |
10.5132 | medical technology management programmes | Biomedical engineering |
10.0125 | medical technology management problems | Biomedical education |
8.7924 | medical technology profession | Biomedical engineering |
6.6519 | medical technology management problems | Health care |
[Norgard 1998] Norgard, B. (1998).   Entry Vocabulary Modules and Agents Technical report. [ HTML]
[Plaunt 1998] Plaunt, C. and B. A. Norgard (1998). An association based method for automatic indexing with a controlled vocabulary. Journal of the American Society for Information Science.[ HTML]