A part-of-speech tagger for English



Developed at:
University of Tokyo, Department of Computer Science,
Tsujii laboratory

Version 1.0

Overview

Tagging speed is crucial in large-scale information extraction and real-time NLP applications. This part-of-speech (POS) tagger offers fast tagging (2400 tokens/sec) with a state-of-the-art accuracy (97.10% on the WSJ corpus). The tagger uses an extension of Maximum Entropy Markov Models (MEMM), in which tags are determined in the easiest-first mannar. For details of the algorithm and performance, see [1].

How to use the tagger

The tagger is tested only on linux and gcc.

1. Download the latest version of the tagger

2. Expand the archive

> tar xvzf postagger.tar.gz

3. Make

> cd postagger/
> make

4. Tag sentences

Prepare a text file containing one sentence per line, then,

> ./tagger < TEXTFILE > TAGGEDTEXT

Example

> echo "He opened the window." | ./tagger
He/PRP opened/VBD the/DT window/NN ./.
>

References

[1] Yoshimasa Tsuruoka and Jun'ichi Tsujii, Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data, Proceedings of HLT/EMNLP 2005, pp. 467-474. (pdf).


This page is maintained by Yoshimasa Tsuruoka