Search code examples
pythonnltktokenize

How to tokenize a sentence with known biwords using nltk?


I'm doing a text analytics task using python. Here, I have used NLTK for the text processing task. With me there is a pre-defined set of biwords, mentioned below.

arr = ['Animo Text Analytics Inc.', 'Amila Iddamalgoda']

And also I have a sentence like below.

sentence = "Amila Iddamalgoda is currently working for Animo Text Analytics Inc. and currently following the Text Mining and Analytics course provided by coursera."

Now I have tokenized this with NLTK.

tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(sentence)

This gives single word tokens (obviously). However, what I need is to match the predefined set of biwords I have (mentioned in the beginning) and take that biword pharases as single token.

eg: Amila Iddamalgoda, currently, working, Animo Text Analytics Inc., follwoing, ...

How can I accomplish this ? Please help me out


Solution

  • Replace all spaces in each occurrence of a multi-word in your text with some clearly recognizable character, e.g., an underscore:

    for expr in arr:
        sentence = re.sub(expr, re.sub(r'\s+', "_", expr), sentence)
    #'Amila_Iddamalgoda is currently working ...'
    

    You can do "normal" tokenization now.

    If you suspect that there is more than one space between words in the text, first create list of regular expressions that match your multi-words:

    toreplace = {r'\s+'.join(a.split()) : '_'.join(a.split()) for a in arr}
    #{'Amila\\s+Iddamalgoda': 'Amila_Iddamalgoda',
    # 'Animo\\s+Text\\s+Analytics\\s+Inc.': 'Animo_Text_Analytics_Inc.'}
    

    Now, apply each replacement pattern to the original sentence:

    for pattern in toreplace:
        sentence = re.sub(pattern, toreplace[pattern], sentence)
    

    Now, again, you can do "normal" tokenization.

    The proposed solution is quite inefficient. If efficiency is important, you can write your own regular tokenizing expression and use nltk.regexp_tokenize().