GENIAcorpus3.02p
The same set of texts as GENIA corpus ver 3.x (2000 abstracts from MEDLINE database) is tagged for parts of speech. The tag set is basically that of Penn Treebank (PTB) POS tag set, with the following major differences.
- The NNP and NNPS (proper name) tag is not used, except for the names of journals, authors, research institutes, and initials of patients. Especially, (discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern blotting) are not tagged with NNP tags.
- We tried to eliminate SYM tags as much as possible.
The abstracts are are first tagged by the JunK tagger and then corrected by human annotators.
The corpus is available in two formats.
- PTB-like format: The file contains one token/POS pair per line, and a "==========" line is put between sentences.
- ``Merged'' gpml format: The POS infromation is merged into GENIA corpus ver 1.1
using <w> tag which surrounds the token, where the POS is represented as the value of `c' attribute.
In the merged format, but not in the PTB-like format, there are some tokens which
are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c='*'>anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.
The corpus is available from the download page.
References
-
Readme (Aug 2003)
- Tateisi, Yuka and Jun'ichi Tsujii. (2004). Part-of-Speech Annotation of Biology Research Abstracts. In the Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. pp. 1267-1270. [PDF]
The pages were last updated on the 16th August 2004 by Tateisi Yuka.
Department of Information Science, Faculty of Science,
University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.