
We have enhanced the GENIA corpus to 2000 abstracts. The base set was taken from the query results with MeSH terms Human, Blood Cells and Transcription Factors of MEDLINE database, and is the superset of the base set of Ver 1.1. The semantic class of technical terms are marked up, like in GENIA corpus version 1.1, but in this version the terms inside other terms are also marked up. We also have corrected sentence bounary errors.
This version is not annotated with the POS information. The POS-tagged corpus is released as Ver 3.0p.
The markup language GPML is also revised. The major revision is that we discarded the <localresource> element and the <term> element. The semantic class information is annotated directly into the abstract, using <cons> elements (GPML quick reference).
A sample set of 3 abstracts can be viewed using the latest versions of Netscape, Mozilla, or Opera (how the source XML file looks like). For users of other browsers, here is how the sample set looks like with the style sheet included in the package.
The full corpus is available from the download page.
The pages were last updated on the 17th March 2003 by Tateisi Yuka.
Department of Information Science, Faculty of Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.