Evaluation Methods: EVM Dictionaries
Evaluation
of the performance of the EVM dictionaries
Youngin Kim, June 30, 2000
download pdf
The basic function of the EVM module is to provide
the mapping from the natural terms of the user’s query to the metadata
terms used in a given database, to help the user in selecting the proper
metadata terms in the process of searching databases.
One way of evaluating the performance of the EVM
is to measure its prediction power in retrieving the “relevant” metadata
terms. Fortunately, we have bases
for these “relevant” terms. Each
document of the database that we build EVM for should have metadata terms
already assigned, which we use for training, and extracting associations
from to build EVM dictionaries.
We take advantage of the existing metadata terms
for the testing of performance of our technique, too, by comparing the
performance of EVM to that of human indexers on the same task. The question
to be answered is, "How good is the EVM dictionary at predicting
the metadata terms already assigned by human indexers?"
By comparing the metadata terms suggested by our
EVM dictionary with the actual terms assigned by the human indexers for
the same documents, we can measure one aspect of effectiveness of EVM
dictionaries. In this report we explain the evaluation method that we
have adopted and show initial evaluation results based on this method.
2.
Methods of Evaluation
2.1. Division of the data set
The above evaluation idea can be accomplished by
using the documents as test queries against our EVM dictionaries. We accomplish this by dividing the data set
into two groups, training and testing.
When we collect the data for building an EVM dictionary,
we can set aside a portion (usually 10 to 30 percent, depending on the
size of the whole collection) of the dataset to be used for the evaluation.
We call it ‘test’ set, while we call the set to be used for the dictionary
building ‘training’ set. Therefore,
the test documents are a subset of the records from the original dataset.
But not used in the process of dictionary building.
2.2. Training and testing
We build the EVM dictionary as usual only with
the training set. Once the dictionary is prepared, we use the documents
in the test set as a “query” set, pretending that the natural terms in
the titles and abstracts of the test set as query terms for the EVM search.
We send these natural terms in each test document for EVM dictionary search
to retrieve the "most likely" metadata terms suggested by the
EVM dictionary. This set of term will be compared to the metadata
terms assigned to the same records by the human indexers.
3. Modes of evaluation
The above basic method of evaluation can be applied
for testing many different variables involved in EVM dictionary building.
We classify those into three basic modes of evaluation.
3.1. General effectiveness of a single EVM dictionary
By comparing the performance of an EVM dictionary to the human indexing
we can test the basic predicting power of each EVM dictionary and this
would test the general effectiveness of EVM approach.
3.2. Effectiveness of different linguistic approaches
a. Comparison of the
word-based dictionary with one that is phrase-based.
So far, we have developed two ways of extracting
meaningful terms from the natural text in the training process. One way is to use single words as terms to
be associated with metadata terms, and the other is to apply a Natural
Language Processing technique to extract noun phrases from the natural
text and use those noun phrases as terms to be associated with metadata
terms. By creating two different EVM dictionaries
based on these two methods with the same data set, we can comparatively
evaluate the effectiveness of different linguistic approaches.
b. Comparison of the
results from more than one NLP techniques
Also, in applying the NLP techniques for extracting
noun phrases, we can use more than one method separately with the same
training data, building multiple EVM dictionaries, and then test those.
This will provide one measure of comparative evaluation of different NLP
methods, and help use decide which NLP method is more suitable for EVM
dictionary building.
3.3. Sensitivity of the subdomain EVM
Another important variable that we can test is the sensitivity of the
subject domain specificity. The idea is that for the databases with very
broad subject coverage, such as INSPEC, subdomain approach, where EVM
dictionaries are created based on more specific and smaller subject domains
than the whole subject range that the database covers, would perform better
in mapping the natural terms to the metadata terms. This can also be tested with the same method
of evaluation by comparing the performance of the EVM dictionaries with
different levels of subject coverage.
3.4. Use of
title and/or abstract
One
important issue involved in EVM dictionary building process is the question
of which data element should be used for the optimal performance of the
resulting dictionary. There are
two fields of the training data that we can extract natural language terms
from; title and abstract. We could
use both in the dictionary building process or either one of them. To make a more reasonable decision on this matter, we could build
two different EVM dictionaries with these two modes and examine their
relative performances.
We could extend this issue to another aspect of
EVM dictionary evaluation. That
is, in the testing process, we could use both title and abstract of the
test documents, which presented as queries to the EVM dictionary search,
or only one of them could be used. We
could compare the results of these two modes and analyze them to make
the simulation test setting more reasonable, and to understand the working
of the EVM’s vocabulary mapping process.
4.
Measurements of the performance
We briefly explained how we want to test the performance
of EVM dictionaries in the above section. The basic idea is to examine
the degree of overlap between the metadata terms suggested by the EVM
dictionary and the terms assigned by human indexers.
The initial result of this examination can be presented
as follows. For each of the queries (one test document in our test setting)
we get the following result from our EVM search.
Metadata terms assigned by the human indexer to
this specific document |
Number
of occurrences
|
Human
Assigned Term
|
Probability
|
13,937
|
pre-main-sequence
stars |
1.000000
|
21,902
|
cosmic
dust |
1.000000
|
29,317
|
circumstellar
matter |
1.000000
|
66,758
|
stellar
evolution |
1.000000
|
99,045
|
galaxy |
1.000000
|
114,492
|
interstellar
matter |
1.000000
|
116,266
|
revue |
1.000000
|
The top ten terms from ranked list of metadata
terms suggested by EVM dictionary search.
Results:
Recall rate = 0.71 (5/7 terms matched) |
Does
EVM term match human assigned term?
1=yes, 0=no
|
Weight
|
Thesaurus
Term
|
1
|
31025.181641
|
galaxy
|
0
|
13584.625000
|
software
agents
|
1
|
12618.86523
|
interstellar
matter
|
0
|
12148.734375
|
object-oriented
programming
|
0
|
11733.599609
|
stellar
spectra
|
1
|
11633.339844
|
cosmic
dust
|
0
|
11312.328125
|
astronomical
spectra
|
0
|
11217.458008
|
grain
size
|
1
|
9971.115234
|
stellar
evolution
|
1
|
9889.843750
|
circumstellar
matter
|
We can tell, for this specific document, that five
out of seven human-assigned terms were actually included in the top ten
of EVM dictionary search result.
There are many possibilities in converting this
kind of results into some measures to represent the overall performance
of the application. We have developed two ways of measuring the performance
based on this result.
4. 1. Average Recall measure
First, we measured the hit rate, which we may call
Recall rate, in the sense that it counts the number of retrieved relevant
terms among the number of assigned (therefore relevant) terms.
In the above example, the Recall rate would be
0.71 since five out of seven relevant terms were retrieved. (5/7 equals
0.71.)
We can get the average Recall rate of the whole
test set and use it as a measure of performance of a given EVM dictionary. This will give us approximate idea on how good
each EVM does with given test set in matching the assigned terms.
There are a number of variables that could be taken
into account to improve this basic measurement of the performance. For
example, the number of original metadata terms, which varies according
to the database or indexing systems, should be considered. On the other hand, since the test set does not overlap with the
training set, there is a possibility that the metadata terms that are
assigned to the records in the test set was not used at all for the records
in the training set. These are
the two examples of the factors to be considered for more robust evaluation
of the EVM performance.
4.2. Overview of Precision and Recall measures
Another way to measure the performance of the EVM
is to present the Recall and Precision rates at the different cutoffs
as a graph to see the overall performance.
For example, at the cutoff level of one, which
means taking only the top ranked terms from the suggested list of terms
by EVM, if this term is one of five human indexed metadata terms, the
Precision is 1.00 and the Recall is .20.
By the same token, at the cutoff level of five, if three out of
five human assigned terms are retrieved, Precision is .60 and the Recall
rate would be .60, too.
At the cutoff level of ten, if four out of five
assigned terms are retrieved, the Precision rate will be .40 and the Recall
rate will be .80.
By averaging the performance of the whole test
query set, we could get the overall Precision and Recall at each cutoff
level of one. Also, by increasing the cutoff level, we could see the overall
performance of a given EVM dictionary for a given test documents.
5.
Testing data set
We have prepared several
data sets for the testing of EVM approach. We used records in the INSPEC
database for our testing.
We tried to collect data
on different levels of subject domain coverage to test the effect of the
specificity of subject domains of the data set. We will describe the issue
of subject domain in EVM approach in another report ().
The general description
of the data sets used in the evaluation follows.
INSPEC general domain
We tried to prepared
one general EVM, which covers the whole range of subject domains of INSPEC
database, by downloading 10% of the all the INSPEC records available on
Melvyl.
We extracted every tenth
record by accession number. This accession number seems to be assigned
to each record incrementally when it is added to the database. This way, we believe we obtained fairly reasonable
representative sample of the whole INSPEC collection. We collected about 150,000 records.
INSPEC subdomain data
sets
For the subdomain data
collection, we relied on the Science Citation Index’s Journal Citation
Report. This report provides an extensive list of subject categories covering
the science and engineering areas, and for each of the subject category
it provides the titles of the most influential journals ranked by their
Journal Impact Factor measurements.
We arbitrarily selected
several subject categories from the Journal Citation Report and used their
list of journal titles for collecting the data set for each subject domain.
a. INSPEC electrical
engineering
b. INSPEC physics
c. INSPEC astronomy and
astrophysics
d. INSPEC aerospace science
6.
Preliminary results of the performance evaluation
With the above listed data sets, we have obtained preliminary evaluation
results. They are presented here according to the two types of measurements
discussed in section 4. These measures can be analyzed in the light of
the variables discussed in section 3.
Among the variables discussed in section 3, sensitivity
to the subdomain approach of EVM dictionary is discussed in another technical
report.
In this report, we first provide evaluation results
of general performance of the selected data sets.
For this test, we used both title and abstract
in the training and testing steps.
The following table contains average Recall measures
of EVM dictionaries, the first measure in the section 4. We used the Recall
rate at the cutoff level of 10 for this measurement.
|
General performance
|
INSPEC general
|
0.323139
|
Electrical
Engineering
|
0.429096
|
Aerospace
Science
|
0.437366
|
Physics
|
0.455524
|
Astrophysics
|
0.660648
|
We present the
evaluation results of the same data sets in a different way in the following
chart, which shows the Precision and Recall rates at each cutoff
level from one to twenty.
|