GENIA LOGO

GENIAcorpus2.1

The same set of texts as GENIA corpus ver 1.1 (670 abstracts from MEDLINE database) is tagged for parts of speech. The tag set is basically that of Penn Treebank (PTB) POS tag set, with the following major differences.

The corpus is available in two formats.

In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c='*'>anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.

The corpus is available from the download page.

Readme(Aug 2003)

The pages were last updated on the 16th August 2004 by Tateisi Yuka.

Department of Information Science, Faculty of Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.