Search code examples
pythonstringmapreducenlpspecial-characters

Stripping punctuation from text in Python


I am trying to get the tokens (words) from a textfile and strip it off all punctuation characters. I am trying the following:

import re 

with open('hw.txt') as f:
    lines_after_254 = f.readlines()[254:]
    sent = [word for line in lines_after_254 for word in line.lower().split()]
    words = re.sub('[!#?,.:";]', '', sent)

I am getting the following error:

return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

Solution

  • A couple of things here in your script. You are not tokenizing, but splitting everything into single chars! Also, you are removing special chars after splitting everything into chars.

    A better way would be to read the input string, remove the special chars and then tokenize the input string.

    import re
    
    # open the input text file and read
    string = open('hw.txt').read()
    print string
    
    # remove the special charaters from the read string
    no_specials_string = re.sub('[!#?,.:";]', '', string)
    print no_specials_string
    
    # split the text and store words in a list
    words = no_specials_string.split()
    print words
    

    Alternatively, if you want to split into tokens first and then remove special characters, you can do this:

    import re
    
    # open the input text file and read
    string = open('hw.txt').read()
    print string
    
    # split the text and store words in a list
    words = string.split()
    print words
    
    # remove special characters from each word in words
    new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
    print new_words