June 28, 1998, Revised July 14, 1998
Acknowledgement: The work reported here was supported by Defense Advanced Research Projects Agency through DARPA Contract N66001-97-C-8541; AO# F477: Search Support for Unfamiliar Metadata Vocabularies.
Association dictionaries are created by linking ordinary language terms (words and noun phrases) to controlled vocabulary terms which are then ranked by co-occurrence frequencies. We use ordinary language terms that occur in titles and abstracts from bibliographic and other types of records. These records represent a particular domain of discourse. (How this domain is defined is discussed elsewhere (Kim 1998).) The controlled vocabulary terms we use are the indexing terms used by any of a number of MELVYL databases that cover general topic areas (e.g., BIOSIS and INSPEC) or other publicly available databases like the U.S. Patents databases. (MELVYL is the University of California's online library system. The MELVYL system includes a library catalog database, a periodicals database, article citation databases, and other files.)
However, there is nothing to prevent a casual user from defining their domain in other ways and we do not see any need to restrict them. So, we can imagine using the topic term to search on the title, abstract, or subject heading fields to form a data set.
Our strategy for selecting journal titles is currently a topic of investigation. We are making use of the Science Citation Index (SCI) and the Social Science Citation Index (SSCI) to identify the most frequently cited (and therefore frequently used) journals in a particular domain. (This is discussed in another report on the work of this project (Kim 1998).)
Theoretically, large sets can be handled effectively, but a rapid response time is crucial in a dynamic environment. When EVMs are created in real time, the time it takes to download a set and process it becomes a significant factor in determining optimal set size. A user can be expected to allow some time for a dictionary to be custom built for his or her needs, but even the patience of an understanding user has limits. At this point, we can only guess how long someone is willing to wait. Determining the minimal size necessary to adequately cover the sublanguage of a domain is a question yet to be answered with any confidence. Since this prototype is envisioned serving as a desktop utility, methods of achieving quick responses without sacrificing comprehensive coverage should be given high priority.
Currently, we conduct these record downloads by starting script at the Unix prompt. (The script program is a common Unix utility used for recording interactions with a computer.) Then a telnet session is initiated with MELVYL and a database is selected (e.g., INSPEC or BIOSIS). A set of records is identified that represents a recognized domain of discourse. This record set must satisfy our size criteria for a ``good set'' and represent the discourse generated by some shared activity that can be defined by a query on the journal title field.
Then, with the script still running, we issue a request to MELVYL for a continuous display of the records with the required tags. When the record display ends, the script process is terminated thereby capturing all the records in a single file (named typescript). This file is then ready to be processed and transformed into an association dictionary.
Adding data retrieval agent functionality
This is a point at which agent functionality in the form of communicating with other systems comes into play. We call this the data retrieval agent.
Communicating with other internet systems
Parts of this data set gathering process can be fully automated with an Expect-like function that establishes communciation with MELVYL, issues the query, and collects the records in a file. For other aspects of the OASIS project, we have used a Perl version of Expect (called Comm.pl) to query MELVYL and download records with satisfactory results.
Expect is written in Tcl (a scripting language). Tcl has proven to be unsatisfactory for processing with large amounts of data in a flexible manner. For instance, large arrays kill Tcl, while Perl handles them with facility. Of course there are work arounds, but Perl presents no such barriers. Therefore, we do not use the Tcl version of Expect. The Perl version has presented no problems handling large data sets.
Why not Z39.50?
We have considered using Z39.50 which is perhaps theoretically more elegant, but it has some drawbacks, one of which is the limit on the number of records returned per transaction. This state of affairs can be dealt with, but we have not pursued the adaptation necessary to incorporate this approach.
Another significant drawback to relying on Z39.50 is the paucity of sites that both comply with the standard and seem suited to our research interests. Until the advantages of Z39.50 become more compelling, we will focus on other issues.
See Kim and Norgard (1998) for a more detailed discussion of this process using natural language processing techniques.
This could be implemented could be as simple as automatically adding a link to a web page for a new association dictionary. Following that link would trigger an interactive search session against the association dictionary named by the link.
Association dictionaries can be searched through a web-browser forms interface with ordinary language queries. The searcher is presented with a ranked list of the most likely controlled vocabulary terms to retrieve information related to a given query. Within certain limits, those controlled vocabulary terms can be used to search the appropriate database. Searching the MELVYL databases other than the library catalog is limited to users associated with the UC system and is not open to the general public, but we provide access to other information resources, such as the U.S. Patents database, to anyone.
Many parts of this dictionary building process are not yet automated, but could be. The parts that are automated require more integration so that this process may proceed without intervention except when user input is necessary or desired. The user should have control over as much of the process as he or she desires. This suggests that there should be levels of control. Some users will want more control and others less.
We prefer to make this process accessible through a web interface so that it will not be limited by platform idiosyncracies. The user will come to this application with a wish to know more about a certain topic the language of which is unfamiliar. We provide methods of specifying a topic area. In our initial design, this will be a list of topic areas selected from SCI and SSCI. The agent would go off and determine whether or not a data set of reasonable size can be gathered for a topic. There are other ways in which this could also proceed. Say, for instance, I want to know what the general trends in computational linguistics are these days. Searching the INSPEC database on the thesaurus term ``computational linguistics'' retrieves 2,740 records from the 1993-1998 database. The 1990-1992 database contains 1,025 records. The 1985-1989 database has 1,145 records. The 1980-1984 database returns 320 records and the 1969-1979 database finds us 313 records. Putting these four sets of records together would produce an adequate data set for building an association dictionary on the topic of ``computational linguistics''. This would not be difficult for the agents managed by an EVM to handle.
[Norgard 1997] Norgard, B. and Y. Kim (1997).   Domains and sublanguages. Technical report.
[Plaunt forthcoming] Plaunt, C. and B. A. Norgard (forthcoming). An association based method for automatic indexing with a controlled vocabulary. Journal of the American Society for Information Science.[ HTML]