Search code examples
pythontokenizedata-cleaning

Why do I get several lists when tokenizing in python?


I am doing a data cleaning task using Python and reading from a text file which contains several sentences. After tokenizing the text file I keep getting a list with the tokens for each sentence as follows:

[u'does', u'anyone', u'think', u'that', u'we', u'have', u'forgotten', u'the', u'days', u'of', u'favours', u'for', u'the', u'pn', u's', u'party', u's', u'friends', u'of', u'friends', u'and', u'paymasters', u'some', u'of', u'us', u'have', u'longer', u'memories']

[u'but', u'is', u'the', u'value', u'at', u'which', u'vassallo', u'brothers', u'bought', u'this', u'property', u'actually', u'relevant', u'and', u'represents', u'the', u'actual', u'value', u'of', u'the', u'property']

[u'these', u'monsters', u'are', u'wrecking', u'the', u'reef', u'the', u'cargo', u'vessels', u'have', u'been', u'there', u'for', u'weeks', u'and', u'the', u'passenger', u'ship', u'for', u'at', u'least', u'24', u'hours', u'now', u'https', u'uploads', u'disquscdn', u'com'].

The code I am doing is the following:

with open(file_path) as fp:
    comments = fp.readlines()

    for i in range (0, len(comments)):

        tokens = tokenizer.tokenize(no_html.lower())
        print tokens

Where no_html is the text file without any html tags. Is there anyone who could tell me how to get all these tokens into one list please ?


Solution

  • Instead of using comments = fp.readlines(), try comments = fp.read() instead.

    What readlines does is it reads all the lines of a file and returns them in a list.

    Another thing you can do is you can just join all the tokenized results into a single list.

    all_tokens = []
    for i in range (0, len(comments)):
    
            tokens = tokenizer.tokenize(no_html.lower())
            #print tokens
            all_tokens.extend(tokens)
    
    print all_tokens