Search code examples
pythonstringdebuggingreplacequotes

removing quotes and double quotes from a list of words


this is my first question on this site. Please forgive me for any formatting or language errors. So this question is based on a book called "think python" by Allen Downey. The activity is to write a python program that reads a book in text format and removes all the whitespace such as spaces and tabs and punctuations and other symbols. I tried many different ways to remove the punctuations and it never removes the quotes and double-quotes. They persistently stay. I'll copy-paste the last code I tried.

import string

def del_punctuation(item):
    '''
        This function deletes punctuation from a word.
    '''
    punctuation = string.punctuation
    for c in item:
        if c in punctuation:
            item = item.replace(c, '')
    return item

def break_into_words(filename):
    '''
        This function reads file, breaks it into 
        a list of used words in lower case.
    '''
    book = open(filename)
    words_list = []
    for line in book:
        for item in line.split():
            item = del_punctuation(item)
            item=item.lower()
            #print(item)
            words_list.append(item)
    return words_list

print(break_into_words('input.txt'))

I have not included the code to remove the whitespace as they work perfectly. I have only included code for removing punctuations. All the punctuational characters are removed except for the quotes and the double-quotes. Please help me by finding the bug in the code or is it something to do with my IDE or compiler? Thanks in advance

input.txt:

“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”

“What is his name?”

“Bingley.”

“Is he married or single?”

“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”

“How so? how can it affect them?”

“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”

“Is that his design in settling here?”

The output I get is copied below:

['“why', 'my', 'dear', 'you', 'must', 'know', 'mrs', 'long', 'says', 'that', 'netherfield', 'is', 'taken', 'by', 'a', 'young', 'man', 'of', 'large', 'fortune', 'from', 'the', 'north', 'of', 'england', 'that', 'he', 'came', 'down', 'on', 'monday', 'in', 'a', 'chaise', 'and', 'four', 'to', 'see', 'the', 'place', 'and', 'was', 'so', 'much', 'delighted', 'with', 'it', 'that', 'he', 'agreed', 'with', 'mr', 'morris', 'immediately', 'that', 'he', 'is', 'to', 'take', 'possession', 'before', 'michaelmas', 'and', 'some', 'of', 'his', 'servants', 'are', 'to', 'be', 'in', 'the', 'house', 'by', 'the', 'end', 'of', 'next', 'week”', '“what', 'is', 'his', 'name”', '“bingley”', '“is', 'he', 'married', 'or', 'single”', '“oh', 'single', 'my', 'dear', 'to', 'be', 'sure', 'a', 'single', 'man', 'of', 'large', 'fortune', 'four', 'or', 'five', 'thousand', 'a', 'year', 'what', 'a', 'fine', 'thing', 'for', 'our', 'girls”', '“how', 'so', 'how', 'can', 'it', 'affect', 'them”', '“my', 'dear', 'mr', 'bennet”', 'replied', 'his', 'wife', '“how', 'can', 'you', 'be', 'so', 'tiresome', 'you', 'must', 'know', 'that', 'i', 'am', 'thinking', 'of', 'his', 'marrying', 'one', 'of', 'them”', '“is', 'that', 'his', 'design', 'in', 'settling', 'here”']

It has removed all the punctuations except for the double quotes and single quotes (there are single quotes in the input I guess). Thanks


Solution

  • Real texts may contains too many tricky symbols: n-dash , m-dash , over ten different quotes " ' ` ‘ ’ “ ” « » ‹› et cetera, et cetera...

    It makes little sense to try to count all the possible punctuation symbols. Common way is try to get only letters (and spaces). Easiest way is to use RegExp:

    import re
    
    text = '''“Why, my dear, you must know, Mrs. Long says that Netherfield is
    taken by a young man of large fortune from the north of England;
    that he came down on Monday in a chaise and four to see the
    place, and was so much delighted with it that he agreed with Mr.
    Morris immediately; that he is to take possession before
    Michaelmas, and some of his servants are to be in the house by
    the end of next week.”
    
    “What is his name?”
    
    “Bingley.”
    
    “Is he married or single?”
    
    “Oh! single, my dear, to be sure! A single man of large fortune;
    four or five thousand a year. What a fine thing for our girls!”
    
    “How so? how can it affect them?”
    
    “My dear Mr. Bennet,” replied his wife, “how can you be so
    tiresome! You must know that I am thinking of his marrying one of
    them.”
    
    “Is that his design in settling here?”'''
    
    # remove everything except letters, spaces, \n and, for example, dashes
    text = re.sub("[^A-z \n\-]", "", text)
    
    # split the text by spaces and \n
    output = text.split()
    
    print(output)
    

    But actually the matter is much more complicated than it looks at first glance. Say I'm is a two words? Probably so. What about someone's? Or rock'n'roll.