OpenNLP Part-of-Speech Tags

August 27, 2013 by Paula Petcu


A Part-of-Speech tagger, a crucial component of a natural language processing system, allows the annotation of words with tags corresponding to the part of speech they represent. However, annotating with part-of-speech tags is often not the last step of processing a text. The tagger is language-dependent and each language has its own part-of-speech tagset, which makes the usage of the annotated tags more complex when dealing with multiple languages. This is a reference for how the POS tagsets map to several languages, including English, Danish, Swedish, and German.

The Part-of-Speech (POS) tagger is one of the most common components of a natural language processing (NLP) system. Apache OpenNLP, a toolkit for processing natural language text, includes this component, which allows the annotation of words with tags corresponding to the part of speech they represent. On the OpenNLP website, pre-trained models for Danish, German, English, Dutch, Swedish, and Portuguese are available for download for version 1.5 of the toolkit. However, the OpenNLP tool also provides the possibility to train new models, done by using a training data set consisting of tag-annotated text collections (read how). The pre-trained models use already annotated text collections called treebanks. For the English POS tagger model from Apache OpenNLP, the texts used for training were annotated with the tags corresponding to the Penn English Treebank.



Here is an example of POS tagging using Apache OpenNLP on an English text using the pre-trained model for English:

Peter and Mary bought a new car today from a car shop in San Francisco .
Peter_NNP and_CC Mary_NNP bought_VBD a_DT new_JJ car_NN today_NN from_IN a_DT car_NN shop_NN in_IN San_NNP Francisco_NNP ._.
And here is the same sentence in Danish tagged using the pre-trained POS model for Danish:
Peter og Mary købte en ny bil i dag fra en bil butik i San Francisco .   
Peter_NP og_CC Mary_NP købte_VA en_PI ny_AN bil_NC i_SP dag_NC fra_SP en_PI bil_NC butik_NC i_SP San_NP Francisco_NP ._XP
Notice the difference between the tag names?


Each language has its own POS tagset

The tagger is language dependent and each language has its own POS tagset, meaning for example that the same person name is annotated with a different tag for Danish texts than for English ones, and some languages don’t make use of certain categories of words (for example articles). However, annotating with POS tags is often not the last step of processing a text and the tags are frequently further used for visualisations, in research evaluations, or used in identifying named entities and their relationship in text. Research in the area of multilingual POS also shows the need to map the tagsets between languages (for example in this article).

Table 1. POS tagset for English (source: Penn Treebank Project)

Number
Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb

Identifying which tagsets were used for the other languages is not so straightforward. However, based on some code from another Apache project, it seems that the Danish follows the PAROLE tagset, the German follows the STTS tagset, the Portuguese uses the PALAVRAS tagset, the WOTAN tagset for Dutch, the Swedish based on the lexical categories in MAMBA.

Table 2. Supposed POS tagset for Danish (created in the PAROLE-DK project by the Danish Society for Language and Literature, according to the Copenhagen Dependency Treebanks project).

Number
Tag
Description
1. NP proper nouns
2. NC common nouns
3. XF foreign words
4. VA main verbs
5. VE medial verbs
6. AN 'normal' adjectives
7. AC cardinal numeral
8. AO ordinal numerals
9. RG adverbs
10. SP prepositions and postpositions
11. CC coordinating conjunctions
12. CS subordinating conjunctions
13. PD demonstrative pronouns
14. PI indefinite pronouns
15. PT interrogative /relative pronouns
16. PP personal pronouns
17. PO possessive pronouns
18. PC reciprocal pronouns
19. I interjections
20. U unique
21. XA abbreviations
22. XP punctuation marks
23. XR various sequences of number and letters
24. XS symbols
25. XX text errors (spelling errors, typos, inflectional errors, conjugational errors, etc.)

Table 3. Supposed POS tagset for German (the STTS Stuttgart Tübingen tag set)

Number
Tag
Description
1. ADJA attributive adjective
2. ADJD adverbial or predicative adjective
3. ADV Adverb
4. APPR Preposition
5. APPRART Preposition with article folded in
6. APPO Postposition
7. APZR Right part of circumposition
8. ART definite or indefinite article
9. CARD cardinal number
10. FM foreign word
11. ITJ interjection
12. KOUI subordinating conjunction with 'zu' and infinitive
13. KOUS subordinating conjunction with sentence
14. KON coordinating conjunction
15. KOKOM comparative conjunction
16. NN common noun
17. NE proper noun
18. PDS substituting demonstrative pronoun
19. PDAT attributive demonstrative pronoun
20. PIS substituting indefinite pronoun
21. PIAT attributive indefinite pronoun
22. PIDAT attributive indefinite pronoun with a determiner
23. PPER non-reflexive personal pronoun
24. PPOSS substituting possessive pronoun
25. PPOSAT attribute adding posessive pronoun
26. PRELS substituting relative pronoun
27. PRELAT attribute adding relative pronoun
28. PRF reflexive personal pronoun
29. PWS substituting interrogative pronoun
30. PWAT attribute adding interrogative pronoun
31. PWAV adverbial interrogative or relative pronoun
32. PAV pronominal adverb
33. PTKZU 'zu' before infinitive
34. PTKNEG Negation particle
35. PTKVZ particle part of separable verb
36. PTKANT answer particle
37. PTKA particle associated with adverb or adjective
38. TRUNC first member of compound noun
39. VVFIN full finite verb
40. VVIMP full imperative
41. VVINF full infinitive
42. VVIZU full infinitive with "zu"
43. VVPP full past participle
44. VAFIN auxilliary finite verb
45. VAIMP auxilliary imperative
46. VAINF auxilliary infinitive
47. VAPP auxilliary past participle
48. VMFIN modal finite verb
49. VMINF modal infinitive
50. VMPP modal past participle
51. XY Non word with special characters
52. $, comma
53. $. sentence ending punctuation
54. $( other sentence internal punctuation

Table 4. Supposed POS tagset for Swedish (based on the lexical categories in MAMBA)

Number
Tag
Description
1. AB Adverb
2. DT Determiner
3. HA WH-adverb
4. HD WH-determiner
5. HP WH-pronoun
6. HS WH-possessive
7. IE Infinitival marker
8. IN Interjection
9. JJ Adjective
10. KN Coordinating conjunction
11. NN Noun
12. PC Participle
13. PL Particle
14. PM Proper Noun
15. PN Pronoun
16. PP Preposition
17. PS Possessive pronoun
18. RG Cardinal number
19. RO Ordinal number
20. SN Subordinating conjunction
21. VB Verb
22. UO Foreign word
23. MAD Major delimiter
24. MID Minor delimiter
25. PAD Pairwise delimiter

“Universal” POS tags

A standardisation proposal was made in A Universal Part-of-Speech Tagset, which maps common tags between 22 different languages (Arabic, Basque, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hungarian, Italian, Japanese, Korean, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish). It defines 12 “universal” POS tags: NOUN (nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), PRON (pronouns), DET (determiners and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), ‘.’ (punctuation marks) and X (other categories). The mapping and tagsets are available at https://code.google.com/p/universal-pos-tags/.




REFERENCES AND FURTHER READING:

1. Definition of Part-of-speech tagging on Wikipedia
2. Documentation for the OpenNLP Part-of-Speech Tagger
3. A Universal Part-of-Speech Tagset, by Slav Petrov, Dipanjan Das and Ryan McDonald
4. Pre-trained models in OpenNLP (Danish, German, English, Spanish, Dutch, Swedish, Portuguese)
5. The Penn Treebank Project
6. Alphabetical list of part-of-speech tags used in the Penn Treebank Project
7. WebCorp Linguist's Search Engine
8. Definintion of Treebank on Wikipedia



comments powered by Disqus