Sensitivity of Entry Vocabulary Modules to Subdomain

Sensitivity of Entry Vocabulary Modules to Subdomains

Youngin Kim

June 28, 1998, Revised July 14, 1998, December 12, 1998

Abstract: In this report we discuss the problem of determining a useful level of subject specificity for vocabulary mapping in the building of Entry Vocabulary Modules (EVM's). We tested the sensitivity of our association dictionary building method to subdomains within existing databases by selecting the training data sets at the subdomain level. We found that the notion of subdomain is useful and plan to implement more principled methods of defining subdomains and identifying training data for these subdomains in building EVM's.

Acknowledgement: The work reported here was supported by Defense Advanced Research Projects Agency through DARPA Contract N66001-97-C-8541; AO# F477: Search Support for Unfamiliar Metadata Vocabularies.

Introduction

The sensitivity of Entry Vocabulary Modules to variations between subdomains within repositories is of concern to us in this project. We discuss some preliminary test results and our approach to integrating the concept of subdomain in the design of EVMs (Task C, Year one).

Metadata vocabularies and domains

Searching is likely to be effective and efficient only when the searcher is familiar with the classification and indexing schemes (metadata vocabularies) used to search databases. The rapid increase in network-accessible databases and the widespread adoption of metadata vocabularies means that it will increasingly be the case that searches will be issued against unfamiliar metadata vocabularies.

Given the fact that searchers need to select information from a large population of heterogeneous repositories with quite diverse metadata vocabularies (i.e., approaches to organizing information that employ different categorization, classification, and indexing semantics), it is seems to be a beneficial move to provide mappings between ordinary language and these metadata vocabularies.

Our Entry Vocabulary Modules are designed to accept statements in the searcher's terms and respond with a ranked list of terms from the system vocabulary to help the searcher to deal with unfamiliar metadata.

In this project we develop association dictionaries that map ordinary language terms to the metadata vocabularies of highly-used databases, such as BIOSIS and INSPEC. We base these association dictionaries on training data from subsets of existing documents in these databases.

Details on the methods and techniques used to build Entry Vocabulary Modules are discussed elsewhere (Plaunt 1998, Norgard 1998).

In developing actual applications with this potential benefit, the next issue in design that confronts us is: What level of domain definition is optimal for mapping ordinary language terms to metadata vocabulary terms?

Ordinarily, metadata vocabularies are studied as a whole and our vocabulary mapping can be designed for a whole database with its own metadata vocabulary. But in practice users are rarely equally interested in all of the contents of the whole database. They are usually interested in some specific subdomain reflecting their particular interest. Hence we have concentrated on Entry Vocabulary Modules that reflect topical, task-oriented subdomains. With these assumptions in mind, we examine a potentially more useful level of mapping with respect to subdomains (e.g., subject domains narrower than that covered by an entire database).

In the following sections, we show the mapping results of our association methods for two different metadata vocabularies. We also discuss a preliminary exploration of the sensitivity of association patterns to the subdomains within a single database.

Sublanguage differences between large metadata vocabularies

In Figure 1, we see examples of metadata vocabulary terms proposed as most likely to be relevant by the association dictionaries based on similarly defined subdomains within two databases. These two database cover different subject domains: BIOSIS (biology) and INSPEC (physics, electronics and computing). Both dictionaries were based on the same subdomain defined by "water" occurring in the titles of bibliographic records in the data sets.

We used the same term, "pollution", for both searches. The results of searching two association dictionaries which provide lists of the metadata vocabulary terms most highly associated with the submitted query are shown below. We refer to the idiosyncratic language used in a subdomain as a sublanguage (Grishman 1986).

Figure 1. Sublanguage differences
Rank BIOSIS: Top 10 Metadata Terms associated with "pollution" INSPEC: Top 10 Metadata Terms associated with "pollution"

1 public health environmental health air water and soil pollution water pollution

2 food and industrial microbiology biodegradation and biodeterioration geophysics computing

3 toxicology environmental and industrial toxicology dynamic programming

4 ecology environmental biology oceanography and limnology query languages

5 general biology institutions administration and legislation lasers

6 ecology environmental biology oceanography software packages

7 animal production general methods process control

8 plant physiology biochemistry and biophysics water relations microcomputers

9 physiology and biochemistry of bacteria fluid mechanics

10 public health disease vectors inanimate information retrieval

Figure 1. Sublanguage differences
Rank	BIOSIS: Top 10 Metadata Terms associated with "pollution"	INSPEC: Top 10 Metadata Terms associated with "pollution"
1	public health environmental health air water and soil pollution	water pollution
2	food and industrial microbiology biodegradation and biodeterioration	geophysics computing
3	toxicology environmental and industrial toxicology	dynamic programming
4	ecology environmental biology oceanography and limnology	query languages
5	general biology institutions administration and legislation	lasers
6	ecology environmental biology oceanography	software packages
7	animal production general methods	process control
8	plant physiology biochemistry and biophysics water relations	microcomputers
9	physiology and biochemistry of bacteria	fluid mechanics
10	public health disease vectors inanimate	information retrieval

These two result sets demonstrate obvious heterogeneity among the metadata vocabularies and suggest the consequent advantages of providing vocabulary mapping devices. By replacing these lists of metadata vocabularies in the place of conventional keyword search results, which might be overwhelmingly large and contain a significant proportion of non-relevant items, we provide the users with the opportunity to do more effective subject searching. Furthermore, by providing a navigation facility through the structure of metadata systems (in our case, we provide a navigation of the INSPEC thesaurus), users have the chance to understand the organization of the metadata vocabulary.

In order to leverage this advantage, we wish to go further to explore the degree of domain definition that is most useful in actual information searching situations. As mentioned earlier, we think that it might be more useful if we can provide a mapping facility for more restricted subject domains because the entirety of BIOSIS and INSPEC is already quite broad in their subject coverage.

Differences among subdomains within a repository

Our assumption is that subdomains within a database like BIOSIS or INSPEC exhibit differences in languages use (i.e., sublanguages).

Creating subject subdomains

As a preliminary exploration, we have defined subject subdomains within the existing databases. For example, from the same INSPEC database, we selected two sets of data by doing a title keyword search with the term "water" and "bio#" (where # is a truncation indicator). By proceeding in this manner, we collected a set of documents that approximately represent a certain subject subdomain: water management studies as one and bio-engineering and biophysics as another, within the broader subject areas covered by INSPEC.

Inspection of the journal titles that were collected within each group indicated that this assumption was largely valid. The following are the lists of the twenty most frequently occurring journal titles in each group of data selected by the searches on "water" and "bio#", respectively. Figure 2 shows the journal titles for "water" subdomain and Figure 3 shows the journal titles for "bio" subdomain. Only one journal (Biophysical Journal) covers both topics; otherwise, they appear to be quite distinct topic areas or subdomains.

Figure 2. Journal titles returned when "water" is the query (INSPEC "water" subdomain)
Figure 2. Journal titles for the INSPEC "water" subdomain
1	Journal of Chemical Physics
2	Journal of Geophysical Research
3	Chemical Physics Letters
4	Journal of Physical Chemistry
5	Nuclear Technology
6	Transactions of the American Nuclear Society
7	Nuclear Engineering and Design
8	Journal of the Acoustical Society of America
9	Proceedings of the 1994 International Topical Meeting on Light Water Reactor
10	Proceedings of the U.S. Nuclear Regulatory Commission

Figure 3. Journal titles with "bio#" as a query in the INSPEC "bio" subdomain.
Figure 3. Journal titles for the INSPEC "bio#" subdomain
1	Biophysical Journal
2	Medical & Biological Engineering & Computing
3	Physics in Medicine and Biology
4	Journal of Biomechanics
5	IEEE Transactions on Biomedical Engineering
6	International Journal of Radiation Oncology Biology Physics
7	International Journal of Radiation Biology
8	Biofizika
9	Biological Cybernetics
10	Computer Methods and Programs in Biomedicine

Testing sensitivity to subdomains

We created two subdomain association dictionaries with these two sets of data from INSPEC. As a test to examine the subdomain sensitivity of our vocabulary mapping method, we submitted the same query, "water", to retrieve the most likely metadata vocabulary terms. The results of this search confirmed our expectation that these two groups of data would be distinguishably different in their use of language.

When we submit the same query "water" for these two dictionaries, we see very different result sets. Figure 4 shows two sets of top ten metadata terms (one from INSPEC "water" dictionary and the other from INSPEC "bio" dictionary) associated with natural term "water" and their differences.

Figure 4. Top 10 INSPEC thesaurus terms returned from the INSPEC-based "water" dictionary compared to the top 10 thesaurus terms returned from the INSPEC-based "bio" dictionary for the query "water".
Top 10 INSPEC Thesaurus Terms returned for the query "water"
Rank	From the INSPEC-based "water" subdomain dictionary	From the INSPEC-based "bio" subdomain dictionary
1	fission reactor fuel	water
2	water supply	biomechanics
3	water	physiological models
4	water treatment	neurophysiology
5	liquid structure	cellular effects of radiation
6	organic insulating materials	cardiology
7	accidents	muscle
8	fission reactor safety	blood
9	polymers	bone
10	fission reactor materials	biomedical ultrasonics

Discussion of the findings

We can see from the result sets that, except for the single common thesaurus term of "water", the two association dictionaries suggest completely different terms. By selecting two different data sets that represent distinct domains of discourse (or subdomains), this method of linking the language used within a particular domain with thesaurus terms results in retrieving quite different sets of metadata vocabulary terms for each domain even when the query is the same.

The vocabulary differences between the heterogenous metadata systems are obvious and clearly expected, but this kind of difference between the subdomains within the same database with same metadata systems is an interesting and potentially useful finding.

It shows that different subject domains have different patterns of associations between the ordinary language terms that appear in both the titles and abstracts and the metadata vocabulary terms assigned to the records in which they occur. That is, depending on the subject subdomain in INSPEC, the same ordinary language term "water" could be said to have been used with different senses in different contexts and was therefore associated with different metadata vocabulary terms.

These association patterns are relatively reliable because they involve subject indexing by human indexers, where subject expertise is presumed and an understanding of the document topics is expected. This method can be said to utilize the human judgement embedded in the association patterns that are captured.

These preliminary results show the sensitivity of our mapping method to subdomains defined within existing databases and the consequent usefulness of providing subdomain levels of vocabulary mapping.

Implications for information retrieval

These findings have significance in the context of information retrieval, the fundamental background of this research. When someone unfamilar with the metadata scheme of a specific database naively submits a subject search with the query term "water" to a large database like INSPEC, he or she will retrieve a very large result set which includes all the subject areas that index "water". This result set will also cover a great number of the many uses of term "water". The coverage of the resulting set of records will in all likelihood be very general, not to mention overwhelming.

However, if such a user were able to specify a subdomain of interest and have the EVM module of our system create an association dictionary based on this specification, it would then be possible to submit an array of queries in this area of interest to the dictionary with much more satisfactory results. In effect, subdomain EVMs add topic focus.

Metadata vocabularies already provide a mapping function by representing more than one indexing term for each record in the database. This is based on human knowledge and judgment. One problem for the user is that this mapping is usually rather hidden so that the user must assume the burden of guessing which ordinary language terms could possibly be mapped to which metadata vocabulary terms.

This approach attempts to ease this burden by tracing the process of human indexing backward. We try to find the patterns in how ordinary language terms are indexed with metadata vocabulary terms in a given domain by statistically examining the pre-indexed items in that domain. We then provide a mapping based on that relationship. As we have shown, if mapping between the two languages is useful at all, mapping at the subdomain level should be still more useful.

Future Work

How to define useful subject domains

Our preliminary explorations reveal significant differences in the association patterns between the natural language and metadata vocabularies among rather arbitrarily defined subdomains within the same broader domain. This suggests that the mapping of vocabularies at the subdomain level should be helpful in easing difficulties in using unfamiliar metadata vocabularies.

Even though it seems to be potentially useful to provide EVMs at a more specific subdomain level than the entire existing database level, it is not a simple task to define and identify meaningful and useful subdomains within a larger more general domain. However, it is unclear how subdomains should best be defined and identified. Therefore, we plan to continue to investigate methods of identifying meaningful subdomains.

Use of the SCI Journal Impact Report for subdomain definitions

We plan to use ranked lists of journals as sources of representative data for a given subject domain. This ranking comes from the Science Citation Index Journal Impact Report and the Social Science Citation Index Journal Impact Report. Each year, the Institute for Scientific Information publishes a report ranking journal impact by measuring the number of times each journal article was cited.

Following these principles, we have begun work on creating a larger sample of subdomain dictionaries with the documents from the most highly ranked journals in given subject fields. We will explore the issue of domain sensitivity and the usefulness of adopting the SCI journal impact factor by testing the results of these subdomain dictionaries.

If this approach continues to seem promising, we plan to go further to develop a dynamic dictionary building module that allows the user select the subdomain. One possible approach would be for EVM agents (EVAs) to present a list of subdomains as defined by SCI and SSCI. After the user chooses one or more subdomains, the appropriate EVAs would gather a data set based on that choice by searching for records from the top ten journals in that subdomain (or subdomains). An association dictionary will then be built and presented to the user for use in searching the database.

References

Grishman, R. and R. Kittredge. (1986). Analyzing language in restricted domains: sublanguage description and processing. Lawrence Erlbaum Associates, Hillsdale, N.J.; London.

Norgard, B. (1998). Entry Vocabulary Modules and Agents Technical report.[ HTML]

Plaunt, C. and B. A. Norgard (1998). An association based method for automatic indexing with a controlled vocabulary. Journal of the American Society for Information Science.[ HTML]