Adding Natural Language Processing Techniques to the Entry Vocabulary Module Building Process

Youngin Kim and Barbara Norgard

June 28, 1998, Revised July 14, 1998

Abstract: In this paper we report the incorporation of NLP techniques into the process of building noun-phrase-based association dictionaries for vocabulary mapping. We discuss why NLP techniques are needed, what NLP techniques are used, how they are used, and show the preliminary results of using association dictionaries based on these NLP techniques. Preliminary tests indicate that the two specific NLP techniques that we used (the Apple Pie Parser and the Brill Tagger) perform adequately, but exhibit some differences, in identifying noun phrases.

Acknowledgement: The work reported here was supported by Defense Advanced Research Projects Agency through DARPA Contract N66001-97-C-8541; AO# F477: Search Support for Unfamiliar Metadata Vocabularies.


As part of the project ``Search Support for Unfamiliar Metadata Vocabularies'', this report addresses the addition of natural language processing techniques to enhance statistical term co-occurrence approaches in creating Entry Vocabulary Modules (Task D, Year one).

Entry Vocabulary Modules (EVMs) are based on single term frequency occurrences. We intend to experiment with the effects of drawing on natural language processing (NLP) techniques and, where applicable, the syntactical relationships frequently used in classification and categorization schemes. We discuss the integration of natural language processing techniques into the process of building EVMs which is described in more detail elsewhere (Norgard98).

Why use NLP techniques?

We started out by making word-based association dictionaries. However, this approach fails to take into account the frequent use of phrases, particularly noun phrases. Discourse in any field draws on the richness of its linguistic resources to represent concepts and their relationships. Noun phrases are more likely than single words to be used to represent complex concepts. We are interested in minimal noun phrase units, such as ``private women's colleges'', rather than more complex noun phrases like ``the number of students who attended private women's colleges in Vermont during the 1970s''. Therefore, we use a tagger and a parser to identify simple noun phrases in our data. By creating both word-based and noun-phrase-based association dictionaries for each data set, we can compare the quality of the mapping each approaches offers.

Which NLP techniques to use?

Nouns and noun phrases are called ``substantive words'' in the field of Computational Linguistics and ``content words'' in Information Science. Due to the importance of the noun phrases for indexing and searching, we are primarily interested in noun phrase identification and extraction techniques. For this reason, we use a parser and a tagger to label the results. We also use stop word lists oriented towards filtering out content-free words and noun phrases that have little utility in searching (e.g., ``the'', ``and'', "the people", and ``it'').

We use one general stop word list, but we anticipate the need for domain-specific stop word lists.

How do we use NLP techniques?

We build two kinds of association dictionaries: a word-based dictionary and a phrase-based dictionary. These are treated differently. To prepare text for word-based association dictionaries, punctuation and stop words are removed. For phrase-based association dictionaries, on the other hand, the text must be preprocessed for parsing and part-of-speech (POS) tagging. Punctuation and stop words are retained as aids to parsing and POS-tagging. We are interested in noun phrases, or content words, which can be extracted from tagger and parser output. Stop words still need to be removed, but only after parsing and POS-tagging is completed.

Preparing data for parsing and tagging

The parser and tagger we use at this time perform best when they are passed one sentence at a time. A title usually is comprised of one or two complete statements that can be treated as sentences. Titles are straightforward candidates for parsing and tagging. Exceptions include parenthetical phrases and other extraneous information that is sometimes appended to the title. The text of the abstracts, however, must be broken apart into sentences. All parenthesized text is removed because it is very difficult to treat the variety of text placed within parentheses in any algorithmic fashion. This does not seem to constitute a significant loss. The text is split into elements (``sentences'') based on any period preceded by two or more characters and followed by a space. This is an imperfect, simplifying assumption and is undermined by the presence of any abbreviations and the like, but works well in practice. The text is also split on semicolons. Every ``sentence'' is delimited by a newline character and stored in a file. The parsers and taggers analyze one sentence at a time and produce an output file of POS-tagged text.

Parsing and POS-tagging

We use two publicly available applications: a parser and a tagger: the Apple Pie Parser and the Brill Tagger. We are interested in identifying and extracting noun phrases that will serve as good index terms. Part of this research involves evaluating how well the tagger and the parser accomplish this task.

The Brill Tagger and the Apple Pie Parser were both trained on data from the Penn Tree Bank, a syntactically tagged corpus which includes the Wall Street Journal (WSJ) Penn Treebank Corpus and the Penn Treebank Brown Corpus. The major difference between the two is that the Apple Pie Parser is a stochastic parser and the Brill Tagger relies on linguistic rules. We are interested in the effects of this difference, if any, on the quality and effectiveness of our association dictionaries.

The Apple Pie Parser

The Apple Pie Parser (APP) was written by Satoshi Sekine and Ralph Grishman as part of the Proteus Project at New York University. The version of the Apple Pie Parser we use (APP Version 5.9) was released on April 4, 1997. It runs on Solaris, Linux, and WindowsNT.

The APP is a bottom-up probabilistic chart parser that looks for the parse tree with the best score by a best-first search algorithm. A syntactic tree with bracketting is generated. The APP tries to make a parse tree as accurate as possible for reasonable well-formed sentences (e.g., sentences in newspapers or well-written documents). It is not designed to parse many ill-formed, but still seemingly reasonable, sentences, as for example those found in typical conversation.

The Brill Tagger

The Brill Tagger is a transformation-based part of speech tagger written by Eric Brill. This tagger is widely used. It was applied to experiments investigating how WordNet can be used with contextual clues to disambiguate terms (Leacock98).

According to Brill, the tagger achieves higher rates of tagging accuracy than traditional stochastic taggers. The Brill Tagger collects linguistic information using less than 200 simple rules. The tagger is composed of a morphological analyzer-generator and a parser-generator.

Attractive features of the Brill Tagger include its modularity and extensibility. All components are available independently. The data can be retrained on any corpus. The program and its data are both easily extensible. Available for research and commercial use, the Brill Tagger is implemented in C, with some utility programs in Perl.

Extracting terms

The results of parsing and tagging allow us to identify and extract noun phrases. The parser and the tagger mark POS differently. Not only do they mark different types of noun phrases, but the way imbedded phrases are marked differs as well. Since neither the parser nor the tagger were specifically designed for our information retrieval purposes, identifying the types of noun phrases optimal for indexing requires identifying patterns that allow us to extract the phrases best suited for the purpose of facilitating information access.

Removing terms

Stop words are removed from the resulting list of noun phrases. The stop word lists used for phrase-based dictionaries differ from those for word-based dictionaries. Our assumption was that some words, for example ``information'' may not be very useful as a single word for the topic representation purpose, however, if it is part of the phrase ``information retrieval'' or ``information processing'', it has better representation value. So, we want to include these kinds of word in the stopword list for word-based dictionaries but keep those for phrase-based dictionary building.


There are many factors to consider in comparing different NLP techniques used for noun phrase identification and extraction. Levels of analysis differ. Noun phrases are classified in different ways. Our focus is on parsers and taggers. There are two basic kinds of parsers and taggers: (1) stochastic and (2) linguistic. They have different strengths and weaknesses. Stochastic parsers and taggers rely completely on statistical frequencies of various sorts. Linguistic parsers and taggers are based on linguistic rules and lexicons.


How should we evaluate the effectiveness of the parser and the tagger? In this case, how effective is a particular parser or tagger in identifying noun phrases?
  1. Practicality: What compromises between efficiency and effectiveness are necessary? How accurate must the tagging be?
  2. Theoretical aspects: On what principles is the parser or the tagger constructed? What are the theoretical assumptions? Are they compatible with the goals of information access?

How information retrieval results would be affected by these differences is a different question to be considered later.

Evaluation approaches

One experimental design to test our vocabulary mapping technique is described in (Plaunt98). This experiment was based on measures of inter-indexer consistency. We can hold a certain proportion of the records apart from a data set used to build an association dictionary. Using the terms found in the titles and abstracts of the held-apart data set to search the association dictionary will return the most highly associated controlled vocabulary terms. These should also be highly likely to have been assigned to these records by human indexers. Comparing the terms returned by the association dictionary with the actual controlled vocabulary terms assigned to the records by human indexers will allow us to measure the similarity between the terms generated by specific dictionary and the ones already assigned. We could also compare the performance measures of more than one dictionary created by different NLP techniques with the same training data as another way of evaluating different NLP techniques. Comparing the closeness of match with human-assigned terms is one obvious basis for evaluation. Such comparison does not address the possibility that terms assigned by association techniques, where different from human-assigned terms, might be equal or superior for retrieval purposes.

Preliminary comparison 1

For these comparisons, two association dictionaries based on results of searches against the INSPEC database with the query ``find journal bio#''. This search returned 13,000+ records.

The APP identified 204,493 noun phrases, while the Brill Tagger only identified 183,085 noun phrases, a difference of 21,408. This could merely mean that the APP identified many more incorrectly or that the Brill Tagger failed to identify a large number of noun phrases that it should have.

The other interesting result of this test is that the parser and the tagger only identified 96,483 noun phrases in common. We found that 47% of the APP's noun phrases were also identified by the Brill Tagger and 53% of the Brill Tagger's noun phrases were identified by the APP.

Preliminary comparison 2

We used two techniques to identify noun phrases. The same query was issued against the association dictionaries created by the two techniques and the results were compared. The comparison included both the A-terms and the B-terms returned. In other words, does the same query result in the same ordinary language noun phrases? And does the same query return the same set of human-assigned controlled vocabulary terms? Moreover, are these controlled vocabulary terms ranked in the same order for both sets?

Partial match query: information

Both dictionaries were queried with the search term: ``information''. These were partial match searches.

The table in Figure 1 shows that all the ordinary language terms returned are shared by both the Apple Pie Parser association dictionary and the Brill Tagger association dictionary. Terms in common are italicized in these tables. For example, in the APP association dictionaries the term closest associated with ``Information'' was ``Information science'', and the second closest was ``Medical information systems'', and the third closest was ``Information systems''. The Brill Tagger closely associated ``Medical information systems'' first, ``Cellular effects of radiation'' second, and ``Health care'' third. The rankings are different however, apparently reflecting a difference in frequency counts resulting from differences in parsing result with each technique. Half of the controlled vocabulary terms in the top ten are shared by both techniques: ``Information science'', ``Medical information systems'', ``Health care'', and ``Cellular effects of radiation''.

Figure 1. Exact match search result for query "information"

Apple Pie Parser Association Dictionary Results
Rank Noun Phrase Search term
78.8814 information Information science
78.1168 information Medical information systems
77.3415 information systems Medical information systems
73.1386 information Health care
70.6619 information technologyInformation technology
62.2819 information Copyright
61.9060 information systems Health care
56.8259 information technology Health care
55.8157 information Cellular effects of radiation
53.3040 information systems Data privacy

Brill Tagger Association Dictionary Results
Rank Noun phrase Search term
126.8530 information systems Medical information systems
98.7863 information Cellular effects of radiation
94.9497 information systems Health care
74.7761 information Biomechanics
73.1447 information Radiation therapy
72.1698 information Information science
71.8793 information Biothermics
71.8011 information technology Information technology
69.9736 information Medical information systems
64.7999 information Bioelectric phenomena

Exact match query: system

Exact match in this case means to search the terms that are in exactly same text form with the query. That is, the search result will show the controlled vocabulary terms that are associated with the natural terms that are exactly same as the query. As the table in Figure 2 demonstrates, an exact match query with the term ``system'', arguably a less specific term, against both association dictionaries showed very little difference in the results they returned. The controlled vocabulary term ``Medical expert systems'' was identified by the APP as a top ten term when APP had been used, but not when the Brill Tagger had been used. The Brill Tagger identified ``Dosimetry'' as a highly associated controlled vocabulary term, while the APP did not. The APP identified ``Medical expert systems'' while Brill Tagger did not. Otherwise the overlap and ranking order is strikingly similar.

Figure 2. Exact match search result for query "system"

Apple Pie Parser Association Dictionary Results
Rank Noun phrase Search term
248.2393 system Molecular biophysics
195.0609 system Proteins
175.4192 system Cellular effects of radiation
109.9675 system Biomembrane transport
94.6629 system Bioelectric phenomena
88.8001 system Biothermics
82.2171 system Physiological models
76.5039 system Biological effects of gamma-rays
73.6449 system Medical expert systems
66.3451 system Muscle

Brill Tagger Association Dictionary Results
Rank Noun phrase Search term
410.0644 system Molecular biophysics
302.3192 system Cellular effects of radiation
278.1254 system Proteins
163.3447 system Biomembrane transport
155.6857 system Bioelectric phenomena
137.2306 system Biothermics
130.0540 system Physiological models
119.8809 system Muscle
118.6029 system Dosimetry
111.0834 system Biological effects of gamma-rays

Partial match query: water

Partial match means truncated search. That is, the search result will include the controlled vocabulary terms associated with all the natural terms that begin with the same text with the query. Figure 3 shows the results of a query against the association dictionaries with ``water'' looking for partial matches. We see that both identified the ordinary language terms ``water'' and ``water molecules'' as the most frequently occurring terms with ``water'' in them. The APP did not identify ``Solvation'' or ``Molecular biophysics'' as potentially useful controlled vocabulary terms where the Brill Tagger did. It looks like the Brill Tagger may have done better at identifying the term ``water molecules'' as a noun phrase than the APP (five instances versus three instances).

Figure 3. Partial match search result for query "water"

Apple Pie Parser Association Dictionary Results
Rank Noun phrase Search term
227.8642 water Water
139.7056 water Molecular biophysics
71.3624 water molecules Hydrogen bonds
66.1800 water molecules Water
62.5399 water Cardiology
61.9131 water Physiological models
61.4642 water molecules Molecular dynamics method
61.3020 water Cellular effects of radiation
53.7821 water Neurophysiology
51.8685 water Molecular dynamics method

Brill Tagger Association Dictionary Results
Rank Noun phrase Search term
218.6793 water Water
206.5127 water Biomechanics
118.1383 water molecules Molecular dynamics method
106.7155 water Cellular effects of radiation
104.9031 water Physiological models
98.4870 water molecules Solvation
92.9816 water Cardiology
92.9689 water molecules Water
91.9367 water molecules Molecular biophysics
82.8147 water molecules Biomechanics

Exact match query: water

Figure 4 shows the results of an exact match query with ``water''. The results of the exact matching with the ordinary language term ``water'' were very similar for both dictionaries. The top six terms were the same with only minor differences in ranking. Overall, they share nine out of top ten terms.

Figure 4. Exact match search result for query "water"

Apple Pie Parser Association Dictionary Results
Rank Noun phrase Search term
227.8642 water Water
139.7056 water Biomechanics
62.5399 water Cardiology
61.9131 water Physiological models
61.3020 water Cellular effects of radiation
53.7821 water Neurophysiology
51.8685 water Molecular dynamics method
47.3778 water Biodiffusion
44.4720 water Muscle
43.1905 water Haemodynamics

Brill Tagger Association Dictionary Results
Rank Noun phrase Search term
218.6793 water Water
206.5127 water Biomechanics
106.7155 water Cellular effects of radiation
104.9031 water Physiological models
92.9816 water Cardiology
82.2806 water Neurophysiology
73.7564 water Muscle
70.1808 water Medical signal processing
68.2778 water Molecular dynamics method
67.9329 water Haemodynamics

Partial match query: medical

We searched with the query ``medical'' for partial matches to see what kind of noun phrases (A-terms) each technique identified. The results shown in Figure 5 for ordinary language noun phrases were very similar in this case. Both techniques identified ``medical images'' as the most frequently occurring noun phrase. They both also tagged ``medical technology'', ``medical expert systems'', and ``medical knowledge'' as frequent noun phrases. The Brill Tagger did not rank ``medical electronics'' among the top ten, favoring ``medical logic modules'' instead. The APP appears to have failed to identify ``medical logic modules'' as a noun phrase altogether.

Figure 5. Partial match search result for query "medical"

Apple Pie Parser Association Dictionary Results
Rank Noun phrase Search term
93.5097 medical images Medical image processing
78.8728 medical knowledge Decision support systems
60.1598 medical expert systems Medical expert systems
59.2303 medical technology Bio medical engineering
56.9726 medical electronics Societies
56.8708 medical technology Economics
56.5006 medical technology Reviews
55.9864 medical expert systems Algebra
53.1609 medical expert system Medical expert systems
50.7716 medical knowledge Information retrieval systems

Brill Tagger Association Dictionary Results
Rank Noun phrase Search term
119.9875 medical images Medical image processing
109.9154 medical logic modules Logic programming
83.3679 medical logic modules Subroutines
73.6530 medical knowledge Decision support systems
66.9142 medical logic modules Program compilers
66.0145 medical expert systems Medical expert systems
63.1926 medical imaging Medical image processing
62.6058 medical logic modulesKnowledge representation
61.0086 medical technology Bio medical engineering
59.8106 medical technology Economics

Partial match query: medical technology

We searched on the noun phrase ``medical technology'' for partial matches. Figure 6 shows that the top five were exactly the same for both techniques. The Brill Tagger did not identify ``medical technology and advocacy'' as a noun phrase whereas the APP did. However, it is debatable as to whether or not this should be considered a ``good'' noun phrase. Would someone ever search on such a phrase?

The result of the APP from this query includes the incorrectly identified noun phrase ``medical technology cannot''. This may be a problem with our approach to extracting tagged data or it may reflect the quality of parsing done by the APP. Further investigation is needed to determine the cause of this kind of problem.

Figure 6. Partial match search result for query "medical technology"

Apple Pie Parser Association Dictionary Results
Rank Noun phrase Search term
59.2303 medical technology Biomedical engineering
56.8708 medical technology Economics
56.5006 medical technology Reviews
36.2614 medical technology Health care
12.2693 medical technology profession Economics
12.2693 medical technology cannot Economics
11.4246 medical technology and advocacy History
11.3357 medical technology management problems Planning
10.3361 medical technology and advocacy Biomedical engineering
10.0648 medical technology management problems Biomedical education

Brill Tagger Association Dictionary Results
Rank Noun phrase Search term
61.0086 medical technology Biomedical engineering
59.8106 medical technology Economics
42.6724 medical technology Reviews
37.3780 medical technology Health care
12.7788 medical technology profession Economics
11.4749 medical technology management problemsPlanning
10.5132 medical technology management programmes Biomedical engineering
10.0125 medical technology management problems Biomedical education
8.7924 medical technology profession Biomedical engineering
6.6519 medical technology management problems Health care


Both natural language processing techniques for identifying noun phrases appear to perform adequately. The Brill Tagger seems to do better in some cases. The Brill Tagger is faster than the Apple Pie Parser.

Future work

We plan to include more Natural Language Processing techniques in our approach. We are exploring the effectiveness of using bigrams to identify noun phrases. We plan to improve our sentence identifier algorithm used in the text pre-processing stage.


[Leacock 1998] Leacock, C. & M. Chodorow (1998).   Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, editor, WordNet: an electronic lexical database. MIT Press, Cambridge, MA.

[Norgard 1998] Norgard, B. (1998).   Entry Vocabulary Modules and Agents Technical report. [ HTML]

[Plaunt 1998] Plaunt, C. and B. A. Norgard (1998). An association based method for automatic indexing with a controlled vocabulary. Journal of the American Society for Information Science.[ HTML]