School of Information Management & Systems, U.C. Berkeley
Search Support for Unfamiliar Metadata Project

An Analysis of the Effects on Searching of the Use Three Subdomain Entry Vocabularies: Technical Note

Michael Buckland

Jun 30, 1999, revised March 12, 2000.

Introduction

The scope ("domain") covered by large bibliographic or textual databases usually includes several specialized topical areas ("subdomains"). Each topical area reflects the work of a community of specialists. These communities evolve their own specialized vocabulary: different terms and specialized meanings of other terms. The obvious approach is to create a single dictionary for the target database as a whole. But searches are usually concerned with a specialized topic within a database, which suggests that search support should be customized to each subdomain. If search support were provided for subdomains, would that lead to a better (more precise) search terms? Would subdomain search support lead to different, better retrieval results?

Preliminary analyses presented below indicate substantial differences in the choice of metadata terms and in the retrieval results.

Three Subdomain Entry Vocabularies

Specialized dictionaries (entry vocabulary modules) were created for three specialized topics (subdomains) within the INSPEC abstracting service. In each case, the training set used to create the dictionary was deliberately limited to literature on that the topic instead of being representative of the database as a whole. The three were:

- Biology - created October 27, 1997 using 13,386 "bio-related" records from the INSPEC 1993-97 dataset;

- Information Studies.

- Water - created April 9, 1996 using 9,613 water-related records from the University of California INSPEC 1990-96 dataset.

These three subdomain dictionaries can be found from the prototypes page.


Four Searches

The same four search queries were submitted to each of the three subdomain entry vocabulary modules:

- "DNA"
- "Gene"
- "Information dissemination" and
- "Water"


Search Results - 1: Choice of Thesaural Terms

Each query resulted in the generation of a ranked list of associated INSPEC Thesaural terms as follows. A search in the form FIND THESAURAL TERM [First ranked term] was conducted on the University of California copy of the INSPEC database on July 5, 1999, and noted below:

Search for "DNA"

-- Using the BIO subdomain EVM: DNA (retrieved 1,970 records); Biomechanics; Radiation therapy; Biomembrane transport; Neurophysiology; Physiological models; Electrophoresis; Cellular effects of radiation; Proteins; Biomolecular effects of radiation.

- Using the INFORMATION STUDIES subdomain EVM: Molecular biophysics (7,762 records); Biotechnology; Biology computing; CDRoms; Information services.

- Using the WATER subdomain EVM: Molecular biophysics (7,762); Biothermics; Bioelectric phenomena; Chemical shift; Liver; Proteins; Electron beam effects; Ions; Cellular biophysics; Water.

It will be noted that the thesaural term "Molecular biophysics" ranks first on both the Information Studies and the Water but there is no other duplication in these suggested thesaural terms.

Search for "Gene"

-- Using the BIO subdomain EVM: Genetics 1,377 records); Cellular effects of radiation; Biomechanics; Cellular biophysics; Physiological models; Biological effects of x rays; DNA; Biological effects of ionising radiation; Temperature; Eigenvalues and eigenfunctions.

- Using the INFORMATION STUDIES subdomain EVM: Biology computing (2,539 records); DNA; Cellular biophysics; Molecular biophysics; Full text databases; Bibliographic systems; Information services.

- Using the WATER subdomain EVM: Biological effects of ionising radiation (1,058 records); Lakes; Radioactive pollution; Water pollution; Accidents.

It will be noted that the thesaural terms "DNA" and "Cellular biophysics" occur in both the Bio and the Information Studies lists, but there is no other duplication in these suggested thesaural terms.

Search for "Information Dissemination"

-- Using the BIO subdomain EVM: Health care 2,632 records); Medical information systems; Security of data; Medical administrative data processing; Molecular biophysics; Data privacy; Biomechanics; Proteins; Bioelectric phenomena.

- Using the INFORMATION STUDIES subdomain EVM: Information science (256 records); Information services; Information needs; Information dissemination; Management information systems; Information retrieval systems; Information centres; Information use; Education; Medical administrative data processing; Economic and sociologic effects.

- Using the WATER subdomain EVM: Water supply (590 records); Management information systems; Information systems; Information needs; Information services; Finance; Water treatment; Geographic information systems.

It will be noted that the thesaural terms "Information services" and "Information needs" occur in both the Information Studies and the Water but there is no other duplication in these suggested thesaural terms.

Search for "Water"

-- Using the BIO subdomain EVM: Water (9,319 records); Biomechanics; Physiological models; Neurophysiology; Cellular effects of radiation; Cardiology; Muscle; Blood; Bone; and Biomedical ultrasonics.

- Using the INFORMATION STUDIES subdomain EVM: Agriculture (1,460 records; Natural resources; Forecasting theory; Operations research; Erosion; Geomorphology; Rain; Soil; Public utilities; and Town and country planning.

- Using the WATER subdomain EVM: Fission reactor safety (2,914 records); Fission reactor fuel; Polymers; Organic insulating materials; Water supply; Cable insulation; Insulation testing; Insulating oils; Liquid structure; and Fission reactor operation.

It will be noted that there is no duplication in these suggested thesaural terms.

Search Results - 2: Retrieval Results

As a preliminary test of the effects on retrieval results, the first ranking thesaural term was used as a query submitted to the University of California's copy of the INSPEC database on June 30, 1999.
The number of items retrieved from the Bio, Information Studies, and Water EVMs for each search are shown on successive rows. Subsequent rows show the number of retrieved items that were retrieved in two or more of these searches by submitting searches for two or three thesaural terms using a Boolean AND in the form: FIND THESAURAL TERM [Term A] AND THESAURAL TERM [Term B] AND [when applicable] THESAURAL TERM [Term C]. For example, the bottom row shows the retrieval results of Boolean AND searches using the first ranked thesaural terms suggested by each of the the subdomain entry vocabulary dictionaries. In searches relating to "DNA", 959 records are common to all three searches, but in the other three search area no items were retrieved: There was no duplication in coverage.

89

Figure 1. Overlap in retrieval results.

Results of individual searches
Subdomain DNAGeneInfo. Diss. Water
Bio 1,970 1,377 2,632 9,319
Info. S. 7,762 2,539256 1,460
Water 7,7621,058 590 2,914
Duplication among pairs of retrieval results
Bio & Info. S. 959 0 1140
Bio & Water 959 153134
Info. S & Water7,762 00 0
Duplication among all three retrieved sets
Bio & Info. S. & Water959 000


Conclusions

These are preliminary, exploratory results and there are numerous methodological issues to be addressed.

Nevertheless, these initial results indicate that topical (subdomain) association dictionaries do reflect significant differences in language use. They generate different lists of thesaural terms and, in three cases out of four, lead to different literatures within the domain of INSPEC. These subdomains appear to be concerned with quite different discourses. These differences might well have been hidden without a subdomain approach. By selecting different data sets that represent distinct domains of discourse (subdomains), this method of linking natural language terms with the thesaural terms used within a particular domain results in retrieving quite different sets of metadata vocabulary terms for each domain (and thereby, quite different documents) even when the query is the same. It shows that different subject domains have different patterns of associations between the ordinary language terms that appear in titles and abstracts and the metadata vocabulary terms assigned to the records in which they occur. That is, depending on the subject subdomain in INSPEC, the same ordinary language term "water" could be said to have been used with different senses in different contexts and was therefore associated with different metadata vocabulary terms. If a searcher were able to specify a subdomain of interest and have the specialized Entry Vocabulary Module, it would then be possible to submit an array of queries in this area of interest to the dictionary with much more satisfactory results. In effect, subdomain vocabulary mappings add topical focus.