python string mapreduce nlp special-characters

Stripping punctuation from text in Python

I am trying to get the tokens (words) from a textfile and strip it off all punctuation characters. I am trying the following:

import re 

with open('hw.txt') as f:
    lines_after_254 = f.readlines()[254:]
    sent = [word for line in lines_after_254 for word in line.lower().split()]
    words = re.sub('[!#?,.:";]', '', sent)

I am getting the following error:

return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

Solution

A couple of things here in your script. You are not tokenizing, but splitting everything into single chars! Also, you are removing special chars after splitting everything into chars.

A better way would be to read the input string, remove the special chars and then tokenize the input string.

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# remove the special charaters from the read string
no_specials_string = re.sub('[!#?,.:";]', '', string)
print no_specials_string

# split the text and store words in a list
words = no_specials_string.split()
print words

Alternatively, if you want to split into tokens first and then remove special characters, you can do this:

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# split the text and store words in a list
words = string.split()
print words

# remove special characters from each word in words
new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
print new_words