Search code examples
pythonnltkspecial-characters

How to remove List special characters ("()", "'",",") from the output of a bi / tri-gram in Python


I have written a code that calculates bigram / trigram frequency from a text input, using NLTK. The problem that I am facing here is that since the output is obtained in form of a Python List, my output contains list specific characters i.e. ("()", "'",","). I plan to export this into a csv file, and thus I would want to remove these special characters at the code level itself. How can I make that edit.

Input Code:

import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
from nltk.corpus import stopwords

corpus = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''
s_corpus = corpus.lower()

stop_words = set(stopwords.words('english'))

tokens = nltk.word_tokenize(s_corpus)
tokens = [word for word in tokens if word not in stop_words]

c_tokens = [''.join(e for e in string if e.isalnum()) for string in tokens]
c_tokens = [x for x in c_tokens if x]

bgs_2 = nltk.bigrams(c_tokens)
bgs_3 = nltk.trigrams(c_tokens)

fdist = nltk.FreqDist(bgs_3)

tmp = list()
for k,v in fdist.items():
    tmp.append((v,k))
tmp = sorted (tmp, reverse=True)

for kk,vv in tmp[:]:
    print (vv,kk)

Current Output:

('looked', 'far', 'looked') 3
('far', 'looked', 'far') 3
('visual', 'held', 'memory') 2
('returned', 'waking', 'nurse') 2

Expected Output:

looked far looked, 3
far looked far, 3
visual held memory, 2
returned waking nurse, 2

Thanks for your help in advance.


Solution

  • A better question would have been what are those ("()", "'",",") in the ngrams output?

    >>> from nltk import ngrams
    >>> from nltk import word_tokenize
    
    # Split a sentence into a list of "words"
    >>> word_tokenize("This is a foo bar sentence")
    ['This', 'is', 'a', 'foo', 'bar', 'sentence']
    >>> type(word_tokenize("This is a foo bar sentence"))
    <class 'list'>
    
    # Extract bigrams.
    >>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))
    [('This', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'sentence')]
    
    # Okay, so the output is a list, no surprise.
    >>> type(list(ngrams(word_tokenize("This is a foo bar sentence"), 2)))
    <class 'list'>
    

    But what type is ('This', 'is')?

    >>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
    ('This', 'is')
    >>> first_thing_in_output = list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
    >>> type(first_thing_in_output)
    <class 'tuple'>
    

    Ah, it's a tuple, see https://realpython.com/python-lists-tuples/

    What happens when you print a tuple?

    >>> print(first_thing_in_output)
    ('This', 'is')
    

    What happens if you convert them into a str()?

    >>> print(str(first_thing_in_output))
    ('This', 'is')
    

    But I want the output This is instead of ('This', 'is'), so I will use the str.join() function, see https://www.geeksforgeeks.org/join-function-python/:

    >>> print(' '.join((first_thing_in_output)))
    This is
    

    Now this is a good point to really go through the tutorial of basic Python types to understand what is happening. Additionally, it'll be good to understand how "container" types work too, e.g. https://github.com/usaarhat/pywarmups/blob/master/session2.md


    Going through the original post, there are quite some issues with the code.

    I guess the goal of the code is to:

    • Tokenize the text and remove stopwords
    • Extract ngrams (without stopwords)
    • Print out their string forms and their counts

    The tricky part is the stopwords.words('english') does not contain punctuation, so you'll end up with strange ngrams that contains punctuations:

    from nltk import word_tokenize
    from nltk.util import ngrams
    from nltk.corpus import stopwords
    
    text = '''The pure amnesia of her face,
    newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
    held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
    '''
    
    stoplist = set(stopwords.words('english'))
    
    tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]
    
    list(ngrams(tokens, 2))
    

    [out]:

    [('The', 'pure'),
     ('pure', 'amnesia'),
     ('amnesia', 'face'),
     ('face', ','),
     (',', 'newborn'),
     ('newborn', '.'),
     ('.', 'I'),
     ('I', 'looked'),
     ('looked', 'far'),
     ('far', ','),
     (',', ','), ...]
    

    Perhaps you would like to extend the stoplist with punctuations, e.g.

    from string import punctuation
    from nltk import word_tokenize
    from nltk.util import ngrams
    from nltk.corpus import stopwords
    
    text = '''The pure amnesia of her face,
    newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
    held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
    '''
    
    stoplist = set(stopwords.words('english') + list(punctuation))
    
    tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]
    
    list(ngrams(tokens, 2))
    

    [out]:

    [('The', 'pure'),
     ('pure', 'amnesia'),
     ('amnesia', 'face'),
     ('face', 'newborn'),
     ('newborn', 'I'),
     ('I', 'looked'),
     ('looked', 'far'),
     ('far', 'looked'),
     ('looked', 'far'), ...]
    

    Then you realized that tokens like I should be a stopword but still exists in your list of ngrams. It's because the list from stopwords.words('english') are lowercased, e.g.

    >>> stopwords.words('english')
    

    [out]:

    ['i',
     'me',
     'my',
     'myself',
     'we',
     'our',
     'ours',
     'ourselves',
     'you',
     "you're", ...]
    

    So when you're checking whether a token is in the stoplist, you should also lowercase the token. (Avoid lowercasing the sentence before word_tokenize because word_tokenize may take cues from capitalization). Thus:

    from string import punctuation
    from nltk import word_tokenize
    from nltk.util import ngrams
    from nltk.corpus import stopwords
    
    text = '''The pure amnesia of her face,
    newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
    held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
    '''
    
    stoplist = set(stopwords.words('english') + list(punctuation))
    
    tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]
    
    list(ngrams(tokens, 2))
    

    [out]:

    [('pure', 'amnesia'),
     ('amnesia', 'face'),
     ('face', 'newborn'),
     ('newborn', 'looked'),
     ('looked', 'far'),
     ('far', 'looked'),
     ('looked', 'far'),
     ('far', 'looked'),
     ('looked', 'far'),
     ('far', 'looked'), ...]
    

    Now the ngrams looks like it's achieving the objectives:

    • Tokenize the text and remove stopwords
    • Extract ngrams (without stopwords)

    Then on the last part where you want to print out the ngrams to a file in sorted order, you could actually use the Freqdist.most_common() which will list in descending order, e.g.

    from string import punctuation
    from nltk import word_tokenize
    from nltk.util import ngrams
    from nltk.corpus import stopwords
    from nltk import FreqDist
    
    text = '''The pure amnesia of her face,
    newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
    held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
    '''
    
    stoplist = set(stopwords.words('english') + list(punctuation))
    
    tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]
    
    FreqDist(ngrams(tokens, 2)).most_common()
    

    [out]:

    [(('looked', 'far'), 4),
     (('far', 'looked'), 3),
     (('visual', 'held'), 2),
     (('held', 'memory'), 2),
     (('memory', 'Little'), 2),
     (('Little', 'little'), 2),
     (('little', 'returned'), 2),
     (('returned', 'waking'), 2),
     (('waking', 'nurse'), 2),
     (('pure', 'amnesia'), 1),
     (('amnesia', 'face'), 1),
     (('face', 'newborn'), 1),
     (('newborn', 'looked'), 1),
     (('far', 'visual'), 1),
     (('nurse', 'visual'), 1)]
    

    (See also: Difference between Python's collections.Counter and nltk.probability.FreqDist)

    Final finally, printing it out to file, you should really use a context manager, http://eigenhombre.com/introduction-to-context-managers-in-python.html

    with open('bigrams-list.tsv', 'w') as fout:
        for bg, count in FreqDist(ngrams(tokens, 2)).most_common():
            print('\t'.join([' '.join(bg), str(count)]), end='\n', file=fout)
    

    [bigrams-list.tsv]:

    looked far  4
    far looked  3
    visual held 2
    held memory 2
    memory Little   2
    Little little   2
    little returned 2
    returned waking 2
    waking nurse    2
    pure amnesia    1
    amnesia face    1
    face newborn    1
    newborn looked  1
    far visual  1
    nurse visual    1
    

    Food for thought

    Now you see this strange bigram Little little, does it make sense?

    It's a by-product of removing by from

    Little by little

    So now, depending on what's the ultimate task for the ngrams you've extracted, you might not really want to remove stopwords from the list.