Subdomain Entry Vocabulary Modules Evaluation

Variation in Subdomain Indexes

Vivien Petras

June 30, 2000

Abstract: Subdomain entry vocabulary modules represent a way to provide a more specialized retrieval vocabulary in a particular subject area. Several subdomain indexes have been derived for an analysis using the INSPEC database. The results show that subdomain indexes differ significantly from each other and from the general-purpose index they were derived from. The document pools that could be retrieved using the different subdomain entry vocabulary modules also differ greatly. If a word can be understood in more than one sense (polysemy), it is more likely to lead to different output from the indexes.

Part of this report is derived from: M.K. Buckland, A. Chen, M. Gebbie, Y. Kim &. B. Norgard: Variation by Subdomain in Indexes to Knowledge Organization Systems

Introduction

In order to analyze how specialized subdomain indexes can support a researcher in finding useful retrieval terms we examined how different these indexes really are. We generated four different indexes to the INSPEC thesaurus: a sample record from the entire database, and specialized subdomain indexes for Biotechnology, Information Science, and Water.

The creation of subdomain indexes are explained in a previous technical report. If the subdomain index indeed provides more purposeful retrieval terms in its specific area than the general index, then the subdomain index can be regarded as a helpful tool in the retrieval process. In a first step, we quantitatively examined how different the subdomain indexes from the general index were. We compared the preferred retrieval terms that each subdomain index suggests for a given search term to the one that the general index would suggest. A second step would be the qualitative examination and evaluation of suggested retrieval terms by actual searchers.

Experiment I: How different are subdomain indexes?

A random sample of 600 words was used to create the four vocabulary indexes. These words were checked against WordNet 1.6, an online thesaurus which enumerates the different meanings of each word. A sample composed of 100 words with a single meaning, 100 words with two meanings, and 100 words with three, four, five, and six meanings was created for each index. The sample from the General Index was then used as query to search against the General Index and the three subdomain indexes. In circa half of the cases one of the subdomain indexes did not contain a thesaurus term for the sample term. Theses sample terms were discarded. For the remaining 357 sample words, the number of different thesaurus terms (from index to index) was counted.
The difference was significant. In 68% of the cases (242 out of 357) the three subdomain indexes suggested a different thesaurus term than the General index, for 23% (81 out of 357), two yielded different terms, for 8% of the queries (30 out of 357) only one index had a different thesaurus term, and in only 4 cases (1%) there was total agreement.

Another experiment with different random samples confirmed this result. Again, 600 word samples were drawn from each of the entry vocabulary indexes. The suggested thesaurus terms (the first preferred one from each subdomain entry vocabulary) from each of the three subdomain indexes were compared. In over half the cases, the sample terms had to be discarded (841 out of 1800 sample terms yielded thesaurus terms in all three subdomain indexes). Also, in this experiment, the overall number of cases yielded three different thesaurus from the three subdomain indexes (87.51%, 736 out of 841). In 96 cases, two subdomain indexes provided the same thesaurus term (11.41%); and in just 10 cases (out of 841, 1.19%) all three subdomain indexes provided the same thesaurus term.
There are real differences between subdomain indexes.

Experiment II: Multiple meanings and index variability

In this experiment, we measured how much the polysemy of words would influence the variability of the subdomain indexes. Each sample word was searched against all three subdomain indexes, which resulted in a "variability" on a scale from 1 to 3 according to whether one, two, or three different thesaural terms were suggested by the indexes. From the process of sampling we already knew that each word of the samples had a number of senses (from WordNet).
We found a strong and positive correlation between the meanings of a sample word and the variability of subdomain indexes. For the Bio sample, the average number of meanings for sample words that had a variability of 1 (all three subdomains yielded only one different thesaurus term) was 3.40. For a variability of 2 (two different thesaurus terms) the average number of senses for the sample term was 3.78 and for a variability of 3 it was 4.26. For the Information Science sample the number of meanings were 3.65 (variability: 1), 3.85 (variability: 2), and 4.51 (variability: 3); and for the Water sample the average number of meanings was 3.69, 3.79, and 4.35 respectively.
As WordNet senses go up, the subdomain variability also goes up.

Experiment III: How different are the results of subdomain retrieval?

To further confirm the results from experiment I (where we analyzed how many common thesaurus terms are suggested by different subdomain entry vocabulary modules) we questioned now how big the overlap in documents, which could actually be retrieved, is. We examined the document pools that could be retrieved using the suggested thesaurus terms from the three special subdomain entry vocabularies Biotechnology, Information Science, and Water.
Firstly, the randomly drawn sample terms (from the two sample sets from experiment I) were submitted to the EVMs to retain the top preferred thesaurus term. Those were then submitted to the INSPEC database to retrieve the actual documents (containing the suggested thesaurus terms). By applying a Boolean query strategy we could find the documents that had more than one of the suggested thesaurus terms in common. For each sample term and its subsequent three suggested thesaurus terms (one Biotechnology thesaurus term, one Information science thesaurus term, and one Water thesaurus term) we submitted the following 7 queries to INSPEC:

  1. number of documents found with the Information science term
  2. number of documents found with the Biotechnology term
  3. number of documents found with the Water term
  4. number of documents found with the Information science AND Biotechnology term (intersection)
  5. number of documents found with the Biotechnology AND Water term (intersection)
  6. number of documents found with the Information science AND Water term (intersection)
  7. number of documents found with the Information science AND Biotechnology AND Water term (intersection).

In order to examine the impact of loose and rigid query strategies we applied two query strategies:

i) rigid query strategy (restrict the number of documents found) by requiring the occurrence of the sample term together with the suggested thesaurus term in the same document
e.g. sample term = galileo, suggested thesaurus term by the Information science EVM = reservation computer systems
query # 1 = FI KW galileo AND XSU reservation computer systems
ii) loose query strategy by requiring only the occurrence of the suggested thesaurus term in the controlled or free subject headings of the document.
e.g. sample term = galileo, suggested thesaurus term by the Information science EVM = reservation computer systems
query # 1 = FI SU reservation computer systems

The results were astounding. The overlap between documents resulting from queries from different subdomain index terms is very small: for the rigid query strategy, 4.16% of the documents retrieved contained all three suggested thesaurus terms (from the three EVMs) and the sample term. Interestingly, for the loose query strategy the number was even smaller (1.07%). Only very few sample terms (1-6) per sample file actually account for the greatest part of this overlap (e.g. sample terms that lead to the same top index terms in all three EVMs).
In general, queries requiring the Information science AND Biotechnology index terms have more documents in common (22.17% for rigid, 5.73 for loose query strategy) than queries requiring the Biotechnology AND Water index terms (18.90% for rigid, 4.53% for loose query strategy), which in turn have more documents in common than those requiring the Information science AND Water index terms (10.63% for rigid, 3.28% for loose query strategy).

These experiments show that special subdomain entry vocabularies do not only suggest different thesaurus terms for a given search term but also lead to different search results (i.e. documents) during the actual retrieval. Different subdomain entry vocabulary modules can help and support a researcher in finding more specific query terms and will finally yield in better and more precise retrieval results.


Last modified: June 30, 2000