|
Translingual Information Management using Domain
Ontologies
project home page
Statement of Work (* indicates amendments
of Nov 16, 1999)
This project, "Translingual Information Management using Domain
Ontologies," will adopt, demonstrate, and assess a new, innovative,
and cost-effective approach to handling text in unfamiliar foreign
languages.
The very large and continuing investment in the creation of online
bibliographies and digital libraries has resulted in a body of tens
of millions of textual records in all languages. Each record is
carefully categorized by topic using a variety of widely-used systems
for the organization of recorded knowledge -- indexing languages,
library classifications, and topical thesauri -- known collectively
as "domain ontologies." This large and rapidly growing infrastructure
is readily accessible online.
This vast infrastructure is carefully maintained in accordance with
well-established, internationally accepted, and increasingly interoperable
standards and protocols, can be viewed a corpus of carefully coded
language fragments – titles and sometimes summaries and even the
full text of documents.
This project will demonstrate how these language fragments can be
extracted and manipulated, using DARPA-funded technology, to:
Base Project:
1. Create topical dictionaries showing the topic(s) associated with
each word in any selected language;
2. Extend the range and scale of these dictionaries using any conventional
bilingual or multilingual dictionaries;
3. Use bilingual parallel texts where available to extend the range
and scale of topical dictionaries;
4. Develop the technology necessary for rapid extraction and deployment
of the data that are available;
5. Collect corpora in digital form of contemporary discourse in
little-documented languages of remote places using non-Roman scripts,
with preference given to local newspaper accounts of current economic,
social and political issues.
*Option B (if funded):
6. Categorize by topic using standard schemes for knowledge organization
texts (fragments, documents, sets of documents) in any language.
*Option C (if funded):
7. Demonstrate how this use of existing standards for categorization
can lead to not only from the word to its topic but also to the
world's literature on that topic and location.
*Option D (if funded):
8. Handle proper nouns & personal names, institution's names, and
place names -- as well as ordinary nouns and verbs;
9. Identity ambiguous place names by determining from clues in the
adjacent text the most probable geographical coordinates;
10. Extend the search for background information on the topic and
location of interest to include data from geo-referenced numeric
data sets.
*Option E (if funded):
Develop font intermediary software for multiple font web page representations.
(Previously part of Task A.3).
The proposed technology would complement existing language engineering
techniques. Since the technology to be developed draws on a very
large existing and ongoing investment for other purposes, substantial
improvements in versatility, speed of deployment, and cost-reduction
are expected. The project includes enabling research on the representation
of non-Roman scripts where fonts representations have not been standardized,
on the use of emerging standards, and the use of specialized vocabulary
in narrowly technical texts.
Fourth and Fifth Year Options:
A fourth year continuation would focus on performance assessment,
scalability, and adaptability to additional challenges. A fifth
year would concentrate on technology transfer, deployment in diverse
task situations, and integration into the evolving digital library
environment.
All software and data corpora will be made freely available, subject
only to acknowledgment. The contractor's staffing requirements are
reflected in the budget which also includes some provision for computing
equipment.
|
|