Part-of-speech (POS) tagging is an initial step of natural language processing which is often performed right after or together with tokenization. After tokenization, every token is assigned a POS label. The GENIA POS annotation generally follows the Penn Treebank POS tagging scheme. The following modifications of this scheme were introduced for the GENIA part-of-speech annotation:
  • The NNP and NNPS (proper name) tag is used only for the names of journals, authors, research institutes, and initials of patients. Especially, (discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern blotting) are not tagged with NNP tags.
  • We tried to eliminate SYM tags as much as possible.
Corpus format

The corpus is available in two formats, both included in the package available for download below.
  • PTB-like format: The file contains one token/POS pair per line, and a "==========" line (ten equal signs) is put between sentences.
  • "Merged" gpml format: The POS information is merged into GENIA corpus ver 3.02 using <w> tag which surrounds the token, where the POS is represented as the value of "c" attribute.
In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.


Annotation guidelines

  • Tateisi, Yuka and Jun'ichi Tsujii. GENIA Annotation Guidelines for Tokenization and POS tagging. Technical Report (TR-NLP-UT-2006-4). Tsujii Laboratory, University of Tokyo, 2006.




