GENIA LOGO

GENIAcorpus3.02p

The same set of texts as GENIA corpus ver 3.x (2000 abstracts from MEDLINE database) is tagged for parts of speech. The tag set is basically that of Penn Treebank (PTB) POS tag set, with the following major differences.

The abstracts are are first tagged by the JunK tagger and then corrected by human annotators.

The corpus is available in two formats.

In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c='*'>anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.

The corpus is available from the download page.

References

  1. Readme (Aug 2003)
  2. Tateisi, Yuka and Jun'ichi Tsujii. (2004). Part-of-Speech Annotation of Biology Research Abstracts. In the Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. pp. 1267-1270. [PDF]

The pages were last updated on the 16th August 2004 by Tateisi Yuka.

Department of Information Science, Faculty of Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.