Search code examples
pythonnlptextblob

Preserving contractions with textblob ngrams


Is there a way to tell #textblob not to split contractions like let's into let & 's when creating ngrams? I know they are technically two separate words, but I'd like to maintain them as one.


Solution

  • It looks like you've got two options here:

    The latter is easier, but slower.

    Changing the Pattern

    TextBlob accepts nltk tokenizers, and I'm more familiar with those, so we're going to use that. nltk's WordPunctTokenizer is a RepexpTokenizer with the pattern "\\w+|[^\\w\\s]+":

    >>> nltk.tokenize.RegexpTokenizer("\\w+|[^\\w\\s]+").tokenize("Let's check this out.")
    ['Let', "'", 's', 'check', 'this', 'out', '.']
    

    Before the disjunction is \w+, which indicates word characters. After the disjunction is [^\w\s], which matches anything that's not a character or whitespace--that is, punctuation.

    If you want to include ' in words, to get "let's", then you can just add that character to the word character portion of the disjunction:

    >>> nltk.tokenize.RegexpTokenizer("[\\w']+|[^\\w\\s]+").tokenize("Let's check this out.")
    ["Let's", 'check', 'this', 'out', '.']
    

    Post-Processing

    The regex approach isn't perfect, though. I suspect TextBlob's built-in tokenizer might be a bit better than what we could hack together with a regex. If you strictly want to take contractions as one token, I recommend just post-processing TextBlob's output.

    >>> tokens = ["Let", "'s", "check", "this", "out", "."]
    >>> def postproc(toks):
    ...     toks_out = []
    ...     while len(toks) > 1:
    ...             bigram = toks[:2]
    ...             if bigram[1][0] == "'":
    ...                     toks_out.append("".join(bigram))
    ...                     toks = toks[2:]
    ...             else:
    ...                     toks_out.append(bigram[0])
    ...                     toks = toks[1:]
    ...     toks_out.extend(toks)
    ...     return toks_out
    ... 
    >>> postproc(tokens)
    ["Let's", 'check', 'this', 'out', '.']
    

    So that gets fixed exactly what you want fixed, but the whole post-processing does add run time to your code.