NTCIRtop image

    GeoTime

NTCIR GeoTime 2010-11

Collections

For NTCIR-8, two collections were used for testing:

Japanese: The Mainichi 2002-2005 news collection of ACLIA Track.

The Mainichi collection has the following document distribution by month:

Mainichi documents by month

 

 

 

 

 

 

 

 

 

 

 

 

 

Year

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Total

2002

7874

7753

8462

8070

8302

8157

9131

8619

8283

8919

8548

8171

100289

2003

7942

7499

8397

8282

8518

8131

9288

8526

8370

8971

8492

8359

100775

2004

7995

7945

8283

7536

7518

7412

8316

7662

6991

7342

7417

6907

91324

2005

6717

6452

7384

7423

7285

7114

7800

7113

7240

7266

7022

6737

85553

Total

30528

29649

32526

31311

31623

30814

34535

31920

30884

32498

31479

30174

377941

 

English: The New York Times 2002-2005 news collection from Linguistic Data Consortium (requires $50US shipping and handling fee)

The New York Times collection has the following document distribution by month:

New York Times collection by month

Month

 

 

 

 

 

 

 

 

 

 

 

 

Year

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Total

2002

9775

10623

11487

11109

10840

10238

4908

9714

10208

11027

9767

10092

119788

2003

5940

1901

2093

2209

1980

1925

2045

1678

2135

2144

1881

1896

27827

2004

1850

1630

1834

1674

372

 

9277

9927

9493

10113

9249

7943

63362

2005

9220

7552

8736

7824

8118

8965

9827

9896

6898

9583

8914

8907

104440

Total

26785

21706

24150

22816

21310

21128

26057

31215

28734

32867

29811

28838

315417

As you can see from the table, there is a reduced document set from January 2003 through June 2004 (with no documents in June 2004). This is a known flaw in the collection, relayed to us by LDC. The source document images used for OCR were too corrupt to produce reliable text, so these documents were omitted. Thus creating topics for stories covered during that period will be different than creating topics for the Mainichi collection for the same period.