I'm using the Brown Corpus. I want some way to print out all the possible tags and their names (not just tag abbreviations). There are also quite a few tags, is there a way to 'simplify' the tags? By simplify I mean combine two extremely similar tags into one and re-tag the merged words with the other tag?
It's somehow discussed previously in:
The POS tag output from nltk.pos_tag
are PennTreeBank tagset, https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html, see What are all possible pos tags of NLTK?
There are several approach but the simplest one might be to use only the first 2 characters of the POS as the main set of POS tags. This is because the first two characters in the POS tag represents the broad classes of POS in Penn Tree Bank tagset.
For instance NNS
means plural noun, and NNP
means proper noun and the NN
tag subsumes all of it by representing the generic noun.
Here's a code example:
>>> from nltk.corpus import brown
>>> from collections import Counter
>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
... x[pos].append(word)
...
>>> x
defaultdict(<type 'list'>, {u'DTI': [u'any'], u'BEN': [u'been'], u'VBD': [u'said', u'produced', u'took', u'said'], u'NP$': [u"Atlanta's"], u'NN-TL': [u'County', u'Jury', u'City', u'Committee', u'City', u'Court', u'Judge', u'Mayor-nominate'], u'VBN': [u'conducted', u'charged', u'won'], u"''": [u"''", u"''", u"''"], u'WDT': [u'which', u'which', u'which'], u'JJ': [u'recent', u'over-all', u'possible', u'hard-fought'], u'VBZ': [u'deserves'], u'NN': [u'investigation', u'primary', u'election', u'evidence', u'place', u'jury', u'term-end', u'charge', u'election', u'praise', u'manner', u'election', u'term', u'jury', u'primary'], u',': [u',', u','], u'.': [u'.', u'.'], u'TO': [u'to'], u'NP': [u'September-October', u'Durwood', u'Pye', u'Ivan'], u'BEDZ': [u'was', u'was'], u'NR': [u'Friday'], u'NNS': [u'irregularities', u'presentments', u'thanks', u'reports', u'irregularities'], u'``': [u'``', u'``', u'``'], u'CC': [u'and'], u'RBR': [u'further'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'IN': [u'of', u'in', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'CS': [u'that', u'that'], u'NP-TL': [u'Fulton', u'Atlanta', u'Fulton'], u'HVD': [u'had', u'had'], u'IN-TL': [u'of'], u'VB': [u'investigate'], u'JJ-TL': [u'Grand', u'Executive', u'Superior']})
>>> len(x)
29
The shorten version looks like this:
>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
... x[pos[:2]].append(word)
...
>>> x
defaultdict(<type 'list'>, {u'BE': [u'was', u'been', u'was'], u'VB': [u'said', u'produced', u'took', u'said', u'deserves', u'conducted', u'charged', u'investigate', u'won'], u'WD': [u'which', u'which', u'which'], u'RB': [u'further'], u'NN': [u'County', u'Jury', u'investigation', u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'manner', u'election', u'term', u'jury', u'Court', u'Judge', u'reports', u'irregularities', u'primary', u'Mayor-nominate'], u'TO': [u'to'], u'CC': [u'and'], u'HV': [u'had', u'had'], u'``': [u'``', u'``', u'``'], u',': [u',', u','], u'.': [u'.', u'.'], u"''": [u"''", u"''", u"''"], u'CS': [u'that', u'that'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'JJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought'], u'IN': [u'of', u'in', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'NP': [u'Fulton', u"Atlanta's", u'Atlanta', u'September-October', u'Fulton', u'Durwood', u'Pye', u'Ivan'], u'NR': [u'Friday'], u'DT': [u'any']})
>>> len(x)
19
Another solution is to use the universal postags, see http://www.nltk.org/book/ch05.html
>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words(tagset='universal')[1:100]:
... x[pos].append(word)
...
>>> x
defaultdict(<type 'list'>, {u'ADV': [u'further'], u'NOUN': [u'Fulton', u'County', u'Jury', u'Friday', u'investigation', u"Atlanta's", u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'Atlanta', u'manner', u'election', u'September-October', u'term', u'jury', u'Fulton', u'Court', u'Judge', u'Durwood', u'Pye', u'reports', u'irregularities', u'primary', u'Mayor-nominate', u'Ivan'], u'ADP': [u'of', u'that', u'in', u'that', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'DET': [u'an', u'no', u'any', u'The', u'the', u'which', u'the', u'the', u'the', u'the', u'which', u'the', u'The', u'the', u'which'], u'.': [u'``', u"''", u'.', u',', u',', u'``', u"''", u'.', u'``', u"''"], u'PRT': [u'to'], u'VERB': [u'said', u'produced', u'took', u'said', u'had', u'deserves', u'was', u'conducted', u'had', u'been', u'charged', u'investigate', u'was', u'won'], u'CONJ': [u'and'], u'ADJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought']})
>>> len(x)
9