Search code examples
pythonpython-3.xlistnltktokenize

Tokenizing words by preserving certain words with arithmetic and logical operators in python 3?


While tokenizing multiple sentences from a large corpus, I need to preserve certain words as in its original form like .Net, C#, C++. I also want to remove the punctuation marks (.,!_-()=*&^%$@~ etc.) but need to preserve the words like .net, .htaccess, .htpassword, c++ etc.

I have tried both nltk.word_tokenize and nltk.regexp_tokenize, but I am not getting the expected output.

Please help me in fixing the aforementioned issue.

The code:

import nltk
from nltk import regexp_tokenize
from nltk.corpus import stopwords


def pre_data():
    tokenized_sentences = nltk.sent_tokenize(tokenized_raw_data)
    sw0 = (stopwords.words('english'))
    sw1 = ["i.e", "dxint", "hrangle", "idoteq", "devs", "zero"]
    sw = sw0 + sw1
    tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|\d|[^.+#\w a-z]", gaps=True)] for word in tokenized_sentences]
    print(tokens)
pre_data()

The tokenized_raw_data is a normal text file. It contains multiple sentences with white spaces in between and consisting of words like .blog, .net, c++, c#, asp.net, .htaccess etc.

Example

['.blog is a generic top-level domain intended for use by blogs'.,

'C# is a general-purpose, multi-paradigm programming language'.,

'C++ is object-oriented programming language'.]


Solution

  • This solution covers the given examples and preserves words like C++, C# asp.net and so on while removing normal punctuation.

    import nltk
    
    paragraph = (
            '.blog is a generic top-level domain intended for use by blogs. '
            'C# is a general-purpose, multi-paradigm programming language. '
            'C++ is object-oriented programming language. '
            'asp.net is something very strange. '
            'The most fascinating language is c#. '
            '.htaccess makes my day!'
    )
    
    def pre_data(raw_data):
        tokenized_sentences = nltk.sent_tokenize(raw_data)
        tokens = [nltk.regexp_tokenize(sentence, pattern='\w*\.?\w+[#+]*') for sentence in tokenized_sentences]
        return tokens
    
    tokenized_data = pre_data(paragraph)
    print(tokenized_data)
    

    Out

    [['.blog', 'is', 'a', 'generic', 'top', 'level', 'domain', 'intended', 'for', 'use', 'by', 'blogs'], 
     ['C#', 'is', 'a', 'general', 'purpose', 'multi', 'paradigm', 'programming', 'language'], 
     ['C++', 'is', 'object', 'oriented', 'programming', 'language'], 
     ['asp.net', 'is', 'something', 'very', 'strange'], 
     ['The', 'most', 'fascinating', 'language', 'is', 'c#'], 
     ['.htaccess', 'makes', 'my', 'day']]
    

    However, this simple regular expression will probably not work for all technical terms in your texts. Provide full examples for a more general solution.