The GENIA Ontology
Ontologies have been developed in the biomedical sciences for several applications.
Such ontologies include conceptual hierarchies for databases covering diseases
and drug names. Construction of a more general ontology (e.g. Gene
Ontology, BioCon Knowledge
Base of the TAMBIS Project) is being attempted by several groups interested
in interconnecting databases under a uniform view.
The GENIA ontology is intended to be a formal model of cell signaling
reactions in human. It is to be used as a basis of thesauri and semantic
dictionaries for natural language processing applications, e.g.,
-
Information retrieval (IR) & filtering (IF)
-
Information extraction (IE)
-
Document and term classification & categorization
-
Summarization, etc.
Another use of the GENIA ontology is to provide the basis for integrated
view of multiple databases including CSNDB
developed at National Institiute of Health
Science.
The current version of the GENIA ontology, a taxonomy of some entities
involved in reactions (Figure 1), was developed
as the semantic classification used in the GENIA
corpus. The links in the figure point to the notes regarding to that
class that serve as scope notes for annotation. They also include examples
of entities that belong to the class.
Notes on Classes of The GENIA Ontology
Substance
The entities under the "substance" node refer to the substances involved
in biochemical reactions. In this taxonomy, the substances are classified
according to their chemical characteristics rather than their biological
role. This is because, in the annotation work, the classes should be as
mutually exclusive as possible and stably defined for the ease of the task.
Chemical classification of substances is quite independent of the biological
context in which it appears, and is therefore more stably defined, and
can be easily expanded into other ontologies.
An amino acid molecule or the compounds that consist of amino acids.
Proteins include protein groups, families, molecules, complexes, and substructures.
A family or a group of proteins, e.g., STATs
A protein complex e.g., RNA polymerase II. The class includes conjugated
proteins such as lipoproteins and glycoproteins.
An individual member of a group of non-complex proteins, e.g., STAT1, STAT2,
STAT3, or a (non-complex) protein not regarded as a member of a particular
group.
A monomer in a complex, e.g., RNA polymerase II alpha subunit.
A secondary structure or a combination of secondary structures, e.g. leucine-zipper,
zinc-finger, alpha-helix,beta-sheet, helix-loop-helix
A tertiary structure that is supposed to have a particular function, e.g.,
SH2, SH3.
A peptide e.g., peptide hormone, 15 amino acids, 18-20 residue-long peptide
fragment
An amino acid monomer e.g., tyrosine, serin, tyr, ser
A nucleic acid molecule or the compounds that consist of nucleic acids.
DNAs include DNA groups, families, molecules, domains, and regions.
A family or a group of DNAs, e.g., myc family genes, rel family genes
An individual member of a family or a group of DNAs, e.g., AP-1/c-jun expression
vector, AP2 cDNA
A substructure of DNA molecule which is supposed to have a particular function,
such as a gene, e.g., c-jun gene, promoter region, Sp1 site, CA repeat.
This class also includes a base sequence that has a particular function.
RNAs include RNA groups, families, molecules, domains, and regions.
A family or a group of RNAs, e.g., tRNAs, viral RNA, HIV mRNA
An individual molecule of RNA, e.g., globlin mRNA, Oct-T1 transcript
A domain or a region of RNA, e.g., polyA site, alternative splicing site
Polynucleotides include primers and synthetic DNA fragment.
An individual nucleotide, e.g., guanine, thymidine, uridine, ATP, GTP
Source
Sources are biological locations where substances are found and their reactions
take place, such as human (an organism), liver (a tissue), leukocyte (a
cell), membrane (a sub-location of a cell) or HeLa (a cultured cell line).
Organisms are further classified into multi-cell organisms, mono-cell organisms
other than viruses, and viruses. In multi-cell organism, tissue, cell,
sub-locations are interrelated with `part-of' relation but that relation
is not shown in Figure 1.
Organisms include multi-cell organisms, mono-cell organisms, and viruses.
A multi-cell organism, e.g., human, mouse
A mono-cell organism other than viruses, e.g., E. Coli, yeast
A virus, e.g., HIV, HTLV, EBV
A body part, e.g., central nervous system, immune system, blood
A tissue, e.g., peripheral blood, lymphoid tissue, vascular endothelium
A cell type, e.g., T-lymphocyte, T cell, astrocyte, fibroblast
A part of cells that has a particular function, e.g., nucleus, cytoplasm
Cultured, immortalized or otherwise artficially processed sources.
The class inculdes cell strains and estublished cell cultures, e.g., HeLa
cell, NIH 3T3, lymphoma line, human bome marrow culture
In the GENIA corpus, the terms that are not categorized as sources or substances
may be marked up, with <subClassOf resource="GENIA#other_names"/>. These
terms represent the entities that play important roles in biological reactions
but not yet fully classified in the GENIA ontology. We will collect these
terms and classify them to further enhance the ontology.