Search code examples
python-3.xpysparknlpn-grampart-of-speech

Tags in Google Ngrams dataset


tl;dr : I can't find a comprehensive list of all tags used in Google Grams Dataset besides that one which only includes PoS tags and _START_, _ROOT_ and _END_.

What do tokens like ,_., ._., _._ mean ? Given their frequencies -- see below -- I'd strongly assume they're tags (they can't be proper tokens).


Context :
I am trying to extract information from Google's n-grams dataset and have troubles understanding some of their tags, and how to take them into account.

Ultimately, I would like to approximate how likely a word will follow another one.
For example, calculating how likely the token protection will follow equal would roughly mean calculating count("equal protection") / count("equal *") where * is the wildcard : any 1gram in the corpus.

The tricky part is calculating that count("equal *").
Indeed, for example, the bi-gram equal to accounts many times in the Google n-grams dataset :

  • as equal to,
  • as equal to_PRT (disambiguated PoS version)
  • as equal _PRT_ (aggregated for all PRT i.e. particles that might follow equal).

As shows when I compute this on pyspark :

>>> total = ggrams.filter(ggrams.ngram.startswith("equal ")).groupby("ngram") \
             .sum("match_count")

>>> total.sort("sum(match_count)", ascending=False).show(n=15)

+------------+----------------+  
|       ngram|sum(match_count)|  
+------------+----------------+  
|equal _NOUN_|        20130934|  
| equal _PRT_|        16620727|  
|    equal to|        16598291|  
|equal to_PRT|        16598291|  
|   equal _._|         5119672|  
| equal _ADP_|         3037747|  
|     equal ,|         2276119|  
|   equal ,_.|         2276119|  
|    equal in|         1682835|  
|equal in_ADP|         1682176|  
|     equal .|         1628257|  
|   equal ._.|         1628257|  
|equal _CONJ_|         1363739|  
|    ...     |             ...|  

So to avoid accounting the same bigram multiple times, my idea was to rather just sum all counts for all patterns like "equal <POS>" where <POS> is in the described PoS set [_PRT_, _NOUN_, ...] (findable here)

Doing this I obtain sum figures that are 1/3rd of the one I'd get from the displayed dataframe above. Which strenghthen my hypothesis above that one count will account three times. But I can't help persuading myself what the best way to do it is, especially notifying these weird tokens ,_., ._., _._ which meanings I don't have any clue.


Solution

  • The list of POS tags given in the documentation does not mention two of the tags, but the 2012 paper Syntactic Annotations for the Google Books Ngram Corpus does:

    • ‘.’ (punctuation marks)
    • X (a catch-all for other categories such as abbreviations or foreign words)

    So the token ,_. is a comma appended with its POS tag, just like the token run_VERB. Similarly, ._. is a full stop appended with its POS tag. Finally, _._ means punctuation, any punctuation just like _VERB_ is any verb.