Search code examples
pythontextblob

Python TextBlob Package - Determines POS tag for '%' symbol but do not print it as a word


I was banging my head with the python's TextBlob package that

  • identifies sentences from paragraphs
  • identifies words from sentences
  • determines POS(Part of Speech) tags for those words, etc...

Everything was going well until I found out a possible issue, if I am not wrong. It is explained below with sample code snippet.

from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample)                         #Passing it to TextBlob package.
Words = blob.words                              #Splitting the Sentence into words.
Tags = blob.tags                                #Determining POS tag for each words in the sentence

print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']

As seen above, blob.tags function is treating '%' symbol as a separate word and determines POS tag as well.

Whereas blob.words function is not even printing '%' symbol either alone or together with its previous word.

I am creating a data frame with the output of both the functions. So it is not getting created due to length mismatch issue.

Here are my questions. Is this possible issue in TextBlob package by any chance ? And is there any way to identify '%' in the Words list ?


Solution

  • Stripping off punctuation at tokenization seems to be a conscious decision by the TextBlob devs: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624

    They rely on NLTK's tokenizators, which take an include_punct parameter, but I don't see a way to pass include_punct=True through TextBlob down to NLTK tokenizer.

    When faced with a similar issue I've replaced interesting punctuation with a non-dictionary text constant that aims to represent it, ie: replace '%' with 'PUNCTPERCENT' before tokenizing. This way, the information that there was a percent symbol doesn't get lost.

    EDIT: I stand corrected, on TextBlob initialization you can set a tokenizer, through the tokenizer argument of its __init__ method https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328.

    So you could easily pass TextBlob a tokenizer that respects punctuation.

    respectful_tokenizer = YourCustomTokenizerRepectsPunctuation()
    blob = TextBlob('some text with %', tokenizer=repectful_tokenizer)
    

    EDIT2: I ran into this while looking at TextBlob's source: https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372 Notice the docstring of the words method, it says you should access the tokens property instead of the words property if you want to include punctuation.