Search code examples
pythondictionarytwitterpreprocessor

Expanding English contractions in, based on a dictionary of most common contractions


I'm trying to substitute contracted words using Python, but am facing errors.

import re
tweet = "I luv my <3 iphone & you're awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com"
contractions_dict = {"ain't": "am not",
                  "aren't": "are not",
                  "can't": "cannot",
                  "you're": "you are"}    

contractions_re = re.compile('(%s)' '|'.join(contractions_dict.keys()))

def expand_contractions(s, contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]

    return contractions_re.sub(replace, s)

expand_contractions(tweet)

I've tried adding a "/" in the "you're", didn't work.

The output is supposed to be the expanded version, but instead the original tweet is just passed through.


Solution

  • Here's a clue:

    >>> print('(%s)' '|'.join(contractions_dict.keys()))
    you're(%s)|aren't(%s)|ain't(%s)|can't
    

    Since %s has no particular meaning in a regex, it will simply match itself. But there is no percent sign in your input, so the match fails.

    I suspect that you were looking for something like

    >>> print('|'.join('(%s)' % k for k in contractions_dict.keys()))
    (you're)|(aren't)|(ain't)|(can't)
    

    Or perhaps

    >>> print('(%s)' % '|'.join(contractions_dict.keys()))
    (you're|aren't|ain't|can't)
    

    But since you are using match.group(0) (i.e., the whole matched string) the captures are irrelevant, and there is no need to parenthesize the words in an alternation. So the simpler solution is fine:

    >>> contractions_re = re.compile('|'.join(contractions_dict.keys()))
    >>> expand_contractions(tweet)
    'I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo happppppy \xf0\x9f\x99\x82 http://www.apple.com'