Evaluation Methods: EVM Dictionaries

Evaluation of the performance of the EVM dictionaries
Youngin Kim, June 30, 2000

download pdf

The basic function of the EVM module is to provide the mapping from the natural terms of the user’s query to the metadata terms used in a given database, to help the user in selecting the proper metadata terms in the process of searching databases.

One way of evaluating the performance of the EVM is to measure its prediction power in retrieving the “relevant” metadata terms.  Fortunately, we have bases for these “relevant” terms.  Each document of the database that we build EVM for should have metadata terms already assigned, which we use for training, and extracting associations from to build EVM dictionaries.

We take advantage of the existing metadata terms for the testing of performance of our technique, too, by comparing the performance of EVM to that of human indexers on the same task. The question to be answered is, "How good is the EVM dictionary at predicting the metadata terms already assigned by human indexers?"

By comparing the metadata terms suggested by our EVM dictionary with the actual terms assigned by the human indexers for the same documents, we can measure one aspect of effectiveness of EVM dictionaries. In this report we explain the evaluation method that we have adopted and show initial evaluation results based on this method.

2. Methods of Evaluation

2.1. Division of the data set

The above evaluation idea can be accomplished by using the documents as test queries against our EVM dictionaries.  We accomplish this by dividing the data set into two groups, training and testing.

When we collect the data for building an EVM dictionary, we can set aside a portion (usually 10 to 30 percent, depending on the size of the whole collection) of the dataset to be used for the evaluation. We call it ‘test’ set, while we call the set to be used for the dictionary building ‘training’ set.  Therefore, the test documents are a subset of the records from the original dataset. But not used in the process of dictionary building.

2.2. Training and testing

We build the EVM dictionary as usual only with the training set. Once the dictionary is prepared, we use the documents in the test set as a “query” set, pretending that the natural terms in the titles and abstracts of the test set as query terms for the EVM search. We send these natural terms in each test document for EVM dictionary search to retrieve the "most likely" metadata terms suggested by the EVM dictionary.  This set of term will be compared to the metadata terms assigned to the same records by the human indexers. 

3.  Modes of evaluation

The above basic method of evaluation can be applied for testing many different variables involved in EVM dictionary building. We classify those into three basic modes of evaluation.

3.1. General effectiveness of a single EVM dictionary

By comparing the performance of an EVM dictionary to the human indexing we can test the basic predicting power of each EVM dictionary and this would test the general effectiveness of EVM approach.

3.2. Effectiveness of different linguistic approaches

a. Comparison of the word-based dictionary with one that is phrase-based.

So far, we have developed two ways of extracting meaningful terms from the natural text in the training process.  One way is to use single words as terms to be associated with metadata terms, and the other is to apply a Natural Language Processing technique to extract noun phrases from the natural text and use those noun phrases as terms to be associated with metadata terms.  By creating two different EVM dictionaries based on these two methods with the same data set, we can comparatively evaluate the effectiveness of different linguistic approaches.

b. Comparison of the results from more than one NLP techniques

Also, in applying the NLP techniques for extracting noun phrases, we can use more than one method separately with the same training data, building multiple EVM dictionaries, and then test those. This will provide one measure of comparative evaluation of different NLP methods, and help use decide which NLP method is more suitable for EVM dictionary building.

3.3. Sensitivity of the subdomain EVM                                     

Another important variable that we can test is the sensitivity of the subject domain specificity. The idea is that for the databases with very broad subject coverage, such as INSPEC, subdomain approach, where EVM dictionaries are created based on more specific and smaller subject domains than the whole subject range that the database covers, would perform better in mapping the natural terms to the metadata terms.  This can also be tested with the same method of evaluation by comparing the performance of the EVM dictionaries with different levels of subject coverage.

3.4. Use of title and/or abstract

One important issue involved in EVM dictionary building process is the question of which data element should be used for the optimal performance of the resulting dictionary.  There are two fields of the training data that we can extract natural language terms from; title and abstract.  We could use both in the dictionary building process or either one of them.  To make a more reasonable decision on this matter, we could build two different EVM dictionaries with these two modes and examine their relative performances.

We could extend this issue to another aspect of EVM dictionary evaluation.  That is, in the testing process, we could use both title and abstract of the test documents, which presented as queries to the EVM dictionary search, or only one of them could be used.  We could compare the results of these two modes and analyze them to make the simulation test setting more reasonable, and to understand the working of the EVM’s vocabulary mapping process.

4. Measurements of the performance

We briefly explained how we want to test the performance of EVM dictionaries in the above section. The basic idea is to examine the degree of overlap between the metadata terms suggested by the EVM dictionary and the terms assigned by human indexers.

The initial result of this examination can be presented as follows. For each of the queries (one test document in our test setting) we get the following result from our EVM search.

Metadata terms assigned by the human indexer to this specific document
Number of occurrences
Human Assigned Term
pre-main-sequence stars
cosmic dust
circumstellar matter
stellar evolution
interstellar matter

The top ten terms from ranked list of metadata terms suggested by EVM dictionary search.

Recall rate = 0.71 (5/7 terms matched)
Does EVM term match human assigned term?

1=yes, 0=no
Thesaurus Term



software agents


interstellar matter


object-oriented programming


stellar spectra


cosmic dust


astronomical spectra


grain size


stellar evolution


circumstellar matter

We can tell, for this specific document, that five out of seven human-assigned terms were actually included in the top ten of EVM dictionary search result. 

There are many possibilities in converting this kind of results into some measures to represent the overall performance of the application. We have developed two ways of measuring the performance based on this result.

4. 1. Average Recall measure

First, we measured the hit rate, which we may call Recall rate, in the sense that it counts the number of retrieved relevant terms among the number of assigned (therefore relevant) terms.

In the above example, the Recall rate would be 0.71 since five out of seven relevant terms were retrieved. (5/7 equals 0.71.)

We can get the average Recall rate of the whole test set and use it as a measure of performance of a given EVM dictionary.  This will give us approximate idea on how good each EVM does with given test set in matching the assigned terms.

There are a number of variables that could be taken into account to improve this basic measurement of the performance. For example, the number of original metadata terms, which varies according to the database or indexing systems, should be considered.  On the other hand, since the test set does not overlap with the training set, there is a possibility that the metadata terms that are assigned to the records in the test set was not used at all for the records in the training set.  These are the two examples of the factors to be considered for more robust evaluation of the EVM performance.

4.2. Overview of Precision and Recall measures

Another way to measure the performance of the EVM is to present the Recall and Precision rates at the different cutoffs as a graph to see the overall performance.

For example, at the cutoff level of one, which means taking only the top ranked terms from the suggested list of terms by EVM, if this term is one of five human indexed metadata terms, the Precision is 1.00 and the Recall is .20.  By the same token, at the cutoff level of five, if three out of five human assigned terms are retrieved, Precision is .60 and the Recall rate would be .60, too.

At the cutoff level of ten, if four out of five assigned terms are retrieved, the Precision rate will be .40 and the Recall rate will be .80.

By averaging the performance of the whole test query set, we could get the overall Precision and Recall at each cutoff level of one. Also, by increasing the cutoff level, we could see the overall performance of a given EVM dictionary for a given test documents.

5. Testing data set

We have prepared several data sets for the testing of EVM approach. We used records in the INSPEC database for our testing.

We tried to collect data on different levels of subject domain coverage to test the effect of the specificity of subject domains of the data set. We will describe the issue of subject domain in EVM approach in another report ().

The general description of the data sets used in the evaluation follows.

INSPEC general domain

We tried to prepared one general EVM, which covers the whole range of subject domains of INSPEC database, by downloading 10% of the all the INSPEC records available on Melvyl.

We extracted every tenth record by accession number. This accession number seems to be assigned to each record incrementally when it is added to the database.  This way, we believe we obtained fairly reasonable representative sample of the whole INSPEC collection.  We collected about 150,000 records.

INSPEC subdomain data sets

For the subdomain data collection, we relied on the Science Citation Index’s Journal Citation Report. This report provides an extensive list of subject categories covering the science and engineering areas, and for each of the subject category it provides the titles of the most influential journals ranked by their Journal Impact Factor measurements.

We arbitrarily selected several subject categories from the Journal Citation Report and used their list of journal titles for collecting the data set for each subject domain.

a. INSPEC electrical engineering
b. INSPEC physics
c. INSPEC astronomy and astrophysics
d. INSPEC aerospace science

6.  Preliminary results of the performance evaluation

With the above listed data sets, we have obtained preliminary evaluation results. They are presented here according to the two types of measurements discussed in section 4. These measures can be analyzed in the light of the variables discussed in section 3.

Among the variables discussed in section 3, sensitivity to the subdomain approach of EVM dictionary is discussed in another technical report. 

In this report, we first provide evaluation results of general performance of the selected data sets.

For this test, we used both title and abstract in the training and testing steps.

The following table contains average Recall measures of EVM dictionaries, the first measure in the section 4. We used the Recall rate at the cutoff level of 10 for this measurement. 

General performance

INSPEC general


Electrical Engineering


Aerospace Science






We present the evaluation results of the same data sets in a different way in the following chart, which shows the Precision and Recall rates at each cutoff level from one to twenty.