June 28, 1998, Revised July 14, 1998, December 12, 1998
Acknowledgement: The work reported here was supported by Defense Advanced Research Projects Agency through DARPA Contract N66001-97-C-8541; AO# F477: Search Support for Unfamiliar Metadata Vocabularies.
Given the fact that searchers need to select information from a large population of heterogeneous repositories with quite diverse metadata vocabularies (i.e., approaches to organizing information that employ different categorization, classification, and indexing semantics), it is seems to be a beneficial move to provide mappings between ordinary language and these metadata vocabularies.
Our Entry Vocabulary Modules are designed to accept statements in the searcher's terms and respond with a ranked list of terms from the system vocabulary to help the searcher to deal with unfamiliar metadata.
In this project we develop association dictionaries that map ordinary language terms to the metadata vocabularies of highly-used databases, such as BIOSIS and INSPEC. We base these association dictionaries on training data from subsets of existing documents in these databases.
Details on the methods and techniques used to build Entry Vocabulary Modules are discussed elsewhere (Plaunt 1998, Norgard 1998).
In developing actual applications with this potential benefit, the next issue in design that confronts us is: What level of domain definition is optimal for mapping ordinary language terms to metadata vocabulary terms?
Ordinarily, metadata vocabularies are studied as a whole and our vocabulary mapping can be designed for a whole database with its own metadata vocabulary. But in practice users are rarely equally interested in all of the contents of the whole database. They are usually interested in some specific subdomain reflecting their particular interest. Hence we have concentrated on Entry Vocabulary Modules that reflect topical, task-oriented subdomains. With these assumptions in mind, we examine a potentially more useful level of mapping with respect to subdomains (e.g., subject domains narrower than that covered by an entire database).
In the following sections, we show the mapping results of our association methods for two different metadata vocabularies. We also discuss a preliminary exploration of the sensitivity of association patterns to the subdomains within a single database.
We used the same term, "pollution", for both searches. The results of searching two association dictionaries which provide lists of the metadata vocabulary terms most highly associated with the submitted query are shown below. We refer to the idiosyncratic language used in a subdomain as a sublanguage (Grishman 1986).
Rank | BIOSIS: Top 10 Metadata Terms associated with "pollution" | INSPEC: Top 10 Metadata Terms associated with "pollution" |
---|---|---|
1 | public health environmental health air water and soil pollution | water pollution |
2 | food and industrial microbiology biodegradation and biodeterioration | geophysics computing |
3 | toxicology environmental and industrial toxicology | dynamic programming |
4 | ecology environmental biology oceanography and limnology | query languages |
5 | general biology institutions administration and legislation | lasers |
6 | ecology environmental biology oceanography | software packages |
7 | animal production general methods | process control |
8 | plant physiology biochemistry and biophysics water relations | microcomputers |
9 | physiology and biochemistry of bacteria | fluid mechanics |
10 | public health disease vectors inanimate | information retrieval |
These two result sets demonstrate obvious heterogeneity among the metadata vocabularies and suggest the consequent advantages of providing vocabulary mapping devices. By replacing these lists of metadata vocabularies in the place of conventional keyword search results, which might be overwhelmingly large and contain a significant proportion of non-relevant items, we provide the users with the opportunity to do more effective subject searching. Furthermore, by providing a navigation facility through the structure of metadata systems (in our case, we provide a navigation of the INSPEC thesaurus), users have the chance to understand the organization of the metadata vocabulary.
In order to leverage this advantage, we wish to go further to explore the degree of domain definition that is most useful in actual information searching situations. As mentioned earlier, we think that it might be more useful if we can provide a mapping facility for more restricted subject domains because the entirety of BIOSIS and INSPEC is already quite broad in their subject coverage.
Creating subject subdomains
As a preliminary exploration, we have defined subject subdomains within the existing databases. For example, from the same INSPEC database, we selected two sets of data by doing a title keyword search with the term "water" and "bio#" (where # is a truncation indicator). By proceeding in this manner, we collected a set of documents that approximately represent a certain subject subdomain: water management studies as one and bio-engineering and biophysics as another, within the broader subject areas covered by INSPEC.
Inspection of the journal titles that were collected within each group indicated that this assumption was largely valid. The following are the lists of the twenty most frequently occurring journal titles in each group of data selected by the searches on "water" and "bio#", respectively. Figure 2 shows the journal titles for "water" subdomain and Figure 3 shows the journal titles for "bio" subdomain. Only one journal (Biophysical Journal) covers both topics; otherwise, they appear to be quite distinct topic areas or subdomains.
Figure 2. Journal titles for the INSPEC "water" subdomain | |
---|---|
1 | Journal of Chemical Physics |
2 | Journal of Geophysical Research |
3 | Chemical Physics Letters |
4 | Journal of Physical Chemistry |
5 | Nuclear Technology |
6 | Transactions of the American Nuclear Society |
7 | Nuclear Engineering and Design |
8 | Journal of the Acoustical Society of America |
9 | Proceedings of the 1994 International Topical Meeting on Light Water Reactor |
10 | Proceedings of the U.S. Nuclear Regulatory Commission |
Figure 3. Journal titles for the INSPEC "bio#" subdomain | |
---|---|
1 | Biophysical Journal |
2 | Medical & Biological Engineering & Computing |
3 | Physics in Medicine and Biology |
4 | Journal of Biomechanics |
5 | IEEE Transactions on Biomedical Engineering |
6 | International Journal of Radiation Oncology Biology Physics |
7 | International Journal of Radiation Biology |
8 | Biofizika |
9 | Biological Cybernetics |
10 | Computer Methods and Programs in Biomedicine |
We created two subdomain association dictionaries with these two sets of data from INSPEC. As a test to examine the subdomain sensitivity of our vocabulary mapping method, we submitted the same query, "water", to retrieve the most likely metadata vocabulary terms. The results of this search confirmed our expectation that these two groups of data would be distinguishably different in their use of language.
When we submit the same query "water" for these two dictionaries, we see very different result sets. Figure 4 shows two sets of top ten metadata terms (one from INSPEC "water" dictionary and the other from INSPEC "bio" dictionary) associated with natural term "water" and their differences.
Top 10 INSPEC Thesaurus Terms returned for the query "water" | ||
---|---|---|
Rank | From the INSPEC-based "water" subdomain dictionary | From the INSPEC-based "bio" subdomain dictionary |
1 | fission reactor fuel | water |
2 | water supply | biomechanics |
3 | water | physiological models |
4 | water treatment | neurophysiology |
5 | liquid structure | cellular effects of radiation |
6 | organic insulating materials | cardiology |
7 | accidents | muscle |
8 | fission reactor safety | blood |
9 | polymers | bone |
10 | fission reactor materials | biomedical ultrasonics |
The vocabulary differences between the heterogenous metadata systems are obvious and clearly expected, but this kind of difference between the subdomains within the same database with same metadata systems is an interesting and potentially useful finding.
It shows that different subject domains have different patterns of associations between the ordinary language terms that appear in both the titles and abstracts and the metadata vocabulary terms assigned to the records in which they occur. That is, depending on the subject subdomain in INSPEC, the same ordinary language term "water" could be said to have been used with different senses in different contexts and was therefore associated with different metadata vocabulary terms.
These association patterns are relatively reliable because they involve subject indexing by human indexers, where subject expertise is presumed and an understanding of the document topics is expected. This method can be said to utilize the human judgement embedded in the association patterns that are captured.
These preliminary results show the sensitivity of our mapping method to subdomains defined within existing databases and the consequent usefulness of providing subdomain levels of vocabulary mapping.
However, if such a user were able to specify a subdomain of interest and have the EVM module of our system create an association dictionary based on this specification, it would then be possible to submit an array of queries in this area of interest to the dictionary with much more satisfactory results. In effect, subdomain EVMs add topic focus.
Metadata vocabularies already provide a mapping function by representing more than one indexing term for each record in the database. This is based on human knowledge and judgment. One problem for the user is that this mapping is usually rather hidden so that the user must assume the burden of guessing which ordinary language terms could possibly be mapped to which metadata vocabulary terms.
This approach attempts to ease this burden by tracing the process of human indexing backward. We try to find the patterns in how ordinary language terms are indexed with metadata vocabulary terms in a given domain by statistically examining the pre-indexed items in that domain. We then provide a mapping based on that relationship. As we have shown, if mapping between the two languages is useful at all, mapping at the subdomain level should be still more useful.
Even though it seems to be potentially useful to provide EVMs at a more specific subdomain level than the entire existing database level, it is not a simple task to define and identify meaningful and useful subdomains within a larger more general domain. However, it is unclear how subdomains should best be defined and identified. Therefore, we plan to continue to investigate methods of identifying meaningful subdomains.
Following these principles, we have begun work on creating a larger sample of subdomain dictionaries with the documents from the most highly ranked journals in given subject fields. We will explore the issue of domain sensitivity and the usefulness of adopting the SCI journal impact factor by testing the results of these subdomain dictionaries.
If this approach continues to seem promising, we plan to go further to develop a dynamic dictionary building module that allows the user select the subdomain. One possible approach would be for EVM agents (EVAs) to present a list of subdomains as defined by SCI and SSCI. After the user chooses one or more subdomains, the appropriate EVAs would gather a data set based on that choice by searching for records from the top ten journals in that subdomain (or subdomains). An association dictionary will then be built and presented to the user for use in searching the database.
Norgard, B. (1998).   Entry Vocabulary Modules and Agents Technical report.[ HTML]
Plaunt, C. and B. A. Norgard (1998).   An association based method for automatic indexing with a controlled vocabulary. Journal of the American Society for Information Science.[ HTML]