Collections
For NTCIR-8, two
collections were used for testing:
Japanese: The Mainichi 2002-2005
news collection of ACLIA Track.
The Mainichi collection has the following document distribution by month:
|
Mainichi documents by month |
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
Year |
Jan |
Feb |
Mar |
Apr |
May |
Jun |
Jul |
Aug |
Sep |
Oct |
Nov |
Dec |
Total |
||
|
2002 |
7874 |
7753 |
8462 |
8070 |
8302 |
8157 |
9131 |
8619 |
8283 |
8919 |
8548 |
8171 |
100289 |
||
|
2003 |
7942 |
7499 |
8397 |
8282 |
8518 |
8131 |
9288 |
8526 |
8370 |
8971 |
8492 |
8359 |
100775 |
||
|
2004 |
7995 |
7945 |
8283 |
7536 |
7518 |
7412 |
8316 |
7662 |
6991 |
7342 |
7417 |
6907 |
91324 |
||
|
2005 |
6717 |
6452 |
7384 |
7423 |
7285 |
7114 |
7800 |
7113 |
7240 |
7266 |
7022 |
6737 |
85553 |
||
|
Total |
30528 |
29649 |
32526 |
31311 |
31623 |
30814 |
34535 |
31920 |
30884 |
32498 |
31479 |
30174 |
377941 |
||
English: The New York Times
2002-2005 news collection from Linguistic
Data Consortium (requires $50US shipping and handling fee)
The New York Times collection has the following document distribution by month:
|
New York Times collection by month |
Month |
|
|
|
|
|
|
|
|
|
|
|
|
|
Year |
Jan |
Feb |
Mar |
Apr |
May |
Jun |
Jul |
Aug |
Sep |
Oct |
Nov |
Dec |
Total |
|
2002 |
9775 |
10623 |
11487 |
11109 |
10840 |
10238 |
4908 |
9714 |
10208 |
11027 |
9767 |
10092 |
119788 |
|
2003 |
5940 |
1901 |
2093 |
2209 |
1980 |
1925 |
2045 |
1678 |
2135 |
2144 |
1881 |
1896 |
27827 |
|
2004 |
1850 |
1630 |
1834 |
1674 |
372 |
|
9277 |
9927 |
9493 |
10113 |
9249 |
7943 |
63362 |
|
2005 |
9220 |
7552 |
8736 |
7824 |
8118 |
8965 |
9827 |
9896 |
6898 |
9583 |
8914 |
8907 |
104440 |
|
Total |
26785 |
21706 |
24150 |
22816 |
21310 |
21128 |
26057 |
31215 |
28734 |
32867 |
29811 |
28838 |
315417 |
As you can see from the table, there is a reduced document set from January 2003 through June 2004 (with no documents in June 2004). This is a known flaw in the collection, relayed to us by LDC. The source document images used for OCR were too corrupt to produce reliable text, so these documents were omitted. Thus creating topics for stories covered during that period will be different than creating topics for the Mainichi collection for the same period.
