Metadata Research Program: TIDES Project Statement of Work

Translingual Information Management using Domain Ontologies project home page

Statement of Work (* indicates amendments of Nov 16, 1999)

This project, "Translingual Information Management using Domain Ontologies," will adopt, demonstrate, and assess a new, innovative, and cost-effective approach to handling text in unfamiliar foreign languages.

The very large and continuing investment in the creation of online bibliographies and digital libraries has resulted in a body of tens of millions of textual records in all languages. Each record is carefully categorized by topic using a variety of widely-used systems for the organization of recorded knowledge -- indexing languages, library classifications, and topical thesauri -- known collectively as "domain ontologies." This large and rapidly growing infrastructure is readily accessible online.

This vast infrastructure is carefully maintained in accordance with well-established, internationally accepted, and increasingly interoperable standards and protocols, can be viewed a corpus of carefully coded language fragments – titles and sometimes summaries and even the full text of documents.

This project will demonstrate how these language fragments can be extracted and manipulated, using DARPA-funded technology, to:

Base Project:
1. Create topical dictionaries showing the topic(s) associated with each word in any selected language;
2. Extend the range and scale of these dictionaries using any conventional bilingual or multilingual dictionaries;
3. Use bilingual parallel texts where available to extend the range and scale of topical dictionaries;
4. Develop the technology necessary for rapid extraction and deployment of the data that are available;
5. Collect corpora in digital form of contemporary discourse in little-documented languages of remote places using non-Roman scripts, with preference given to local newspaper accounts of current economic, social and political issues.

*Option B (if funded):
6. Categorize by topic using standard schemes for knowledge organization texts (fragments, documents, sets of documents) in any language.

*Option C (if funded):
7. Demonstrate how this use of existing standards for categorization can lead to not only from the word to its topic but also to the world's literature on that topic and location.

*Option D (if funded):
8. Handle proper nouns & personal names, institution's names, and place names -- as well as ordinary nouns and verbs;
9. Identity ambiguous place names by determining from clues in the adjacent text the most probable geographical coordinates;
10. Extend the search for background information on the topic and location of interest to include data from geo-referenced numeric data sets.

*Option E (if funded):
Develop font intermediary software for multiple font web page representations. (Previously part of Task A.3).

The proposed technology would complement existing language engineering techniques. Since the technology to be developed draws on a very large existing and ongoing investment for other purposes, substantial improvements in versatility, speed of deployment, and cost-reduction are expected. The project includes enabling research on the representation of non-Roman scripts where fonts representations have not been standardized, on the use of emerging standards, and the use of specialized vocabulary in narrowly technical texts.

Fourth and Fifth Year Options:
A fourth year continuation would focus on performance assessment, scalability, and adaptability to additional challenges. A fifth year would concentrate on technology transfer, deployment in diverse task situations, and integration into the evolving digital library environment.

All software and data corpora will be made freely available, subject only to acknowledgment. The contractor's staffing requirements are reflected in the budget which also includes some provision for computing equipment.

Last Update: 11-Jul-2001