Is there a way to tell #textblob not to split contractions like let's
into let
& 's
when creating ngrams? I know they are technically two separate words, but I'd like to maintain them as one.
It looks like you've got two options here:
The latter is easier, but slower.
Changing the Pattern
TextBlob accepts nltk tokenizers, and I'm more familiar with those, so we're going to use that. nltk's WordPunctTokenizer is a RepexpTokenizer with the pattern "\\w+|[^\\w\\s]+"
:
>>> nltk.tokenize.RegexpTokenizer("\\w+|[^\\w\\s]+").tokenize("Let's check this out.")
['Let', "'", 's', 'check', 'this', 'out', '.']
Before the disjunction is \w+
, which indicates word characters. After the disjunction is [^\w\s]
, which matches anything that's not a character or whitespace--that is, punctuation.
If you want to include '
in words, to get "let's"
, then you can just add that character to the word character portion of the disjunction:
>>> nltk.tokenize.RegexpTokenizer("[\\w']+|[^\\w\\s]+").tokenize("Let's check this out.")
["Let's", 'check', 'this', 'out', '.']
Post-Processing
The regex approach isn't perfect, though. I suspect TextBlob's built-in tokenizer might be a bit better than what we could hack together with a regex. If you strictly want to take contractions as one token, I recommend just post-processing TextBlob's output.
>>> tokens = ["Let", "'s", "check", "this", "out", "."]
>>> def postproc(toks):
... toks_out = []
... while len(toks) > 1:
... bigram = toks[:2]
... if bigram[1][0] == "'":
... toks_out.append("".join(bigram))
... toks = toks[2:]
... else:
... toks_out.append(bigram[0])
... toks = toks[1:]
... toks_out.extend(toks)
... return toks_out
...
>>> postproc(tokens)
["Let's", 'check', 'this', 'out', '.']
So that gets fixed exactly what you want fixed, but the whole post-processing does add run time to your code.