Preserving contractions with textblob ngrams

Is there a way to tell #textblob not to split contractions like let's into let & 's when creating ngrams? I know they are technically two separate words, but I'd like to maintain them as one.

Solution

It looks like you've got two options here:

Change the tokenizer used in TextBlob.
Post-process the tokens.

The latter is easier, but slower.

Changing the Pattern

TextBlob accepts nltk tokenizers, and I'm more familiar with those, so we're going to use that. nltk's WordPunctTokenizer is a RepexpTokenizer with the pattern "\\w+|[^\\w\\s]+":

>>> nltk.tokenize.RegexpTokenizer("\\w+|[^\\w\\s]+").tokenize("Let's check this out.")
['Let', "'", 's', 'check', 'this', 'out', '.']

Before the disjunction is \w+, which indicates word characters. After the disjunction is [^\w\s], which matches anything that's not a character or whitespace--that is, punctuation.

If you want to include ' in words, to get "let's", then you can just add that character to the word character portion of the disjunction:

>>> nltk.tokenize.RegexpTokenizer("[\\w']+|[^\\w\\s]+").tokenize("Let's check this out.")
["Let's", 'check', 'this', 'out', '.']

Post-Processing

The regex approach isn't perfect, though. I suspect TextBlob's built-in tokenizer might be a bit better than what we could hack together with a regex. If you strictly want to take contractions as one token, I recommend just post-processing TextBlob's output.

>>> tokens = ["Let", "'s", "check", "this", "out", "."]
>>> def postproc(toks):
...     toks_out = []
...     while len(toks) > 1:
...             bigram = toks[:2]
...             if bigram[1][0] == "'":
...                     toks_out.append("".join(bigram))
...                     toks = toks[2:]
...             else:
...                     toks_out.append(bigram[0])
...                     toks = toks[1:]
...     toks_out.extend(toks)
...     return toks_out
... 
>>> postproc(tokens)
["Let's", 'check', 'this', 'out', '.']

So that gets fixed exactly what you want fixed, but the whole post-processing does add run time to your code.