
Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning. With the explosion of results in molecular-biology there is an increased need for IE to extract knowledge to support database building and to search intelligently for information in online journal collections. To support this we are building a corpus of annotated abstracts taken from National Library of Medicine's MEDLINE database. In GENIA Corpus we annotate a subset of the substances and the biological locations involved in reactions of proteins, based on a data model (GENIA ontology) of the biological domain, in XML format (GPML).
GENIA Corpus Version 3.0x consists of 2000 abstracts. The base abstracts are selected from the search results with keywords (MeSH terms) Human, Blood Cells, and Transcription Factors.The corpus and the GPML DTD are available from our download page.
Older releases, Version 1.0 (470 abstracts) and Version 1.1 (670 abstracts which includes the 470 of Version 1.0) are also available.
The pages were last updated on the 17th August 2004 by Yuka Tateisi.
Department of Information Science, Faculty of Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.