GENIAcorpus3.0p
This is the POS-annotated version of the GENIA corpus Ver 3.0 (2000 abstracts).
As the version 2.1, the tag set is basically that of Penn Treebank (PTB) POS tag set, with the following major differences.
- The NNP and NNPS (proper name) tag is not used, except for the names
of journals, authors, research institutes, and initials of patients. Especially,
(discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern
blotting) are not tagged with NNP tags.
- We tried to eliminate SYM tags as much as possible.
The corpus is available in three formats.
- PTB-like format: The file contains one token/POS pair per line,
and a "====================" line is put between sentences.
- xml format: A token is enclosed with <w> tags, where the POS is
represented as the value of `c' attribute.
- ``Merged'' gpml format: The POS infromation is merged into
GENIA corpus ver 3.0. A token may be split by <cons> tags assigned by
the annotators of original GENIA corpus.
In such cases, the last
fragment of a split token is assigned the original POS tag assigned by
POS annotators, and other fragments are assigned "*",
e.g. <cons lex="ER-mediated_repression" sem="G#other_name"><cons
lex="ER" sem="G#protein_family_or_group"><w c="*">ER</w></cons><w
c="JJ">-mediated</w> <w c="NN">repression</w>.
The xml files have been checked with Internet Explorer 6.0 and Mozilla 1.1.
The corpus is available from the download page.