I am creating a function which reads data of a txt file, the text file is set up as one sentence per line. I have 6 requirements to strip the file of to make it usable later on in my program:
1. Make everything lowercase
2. Split the line into words
3. Remove all punctuation, such as “,”, “.”, “!”, etc.
4. Remove apostrophes and hyphens, e.g. transform “can’t” into “cant” and
“first-born” into “firstborn”
5. Remove the words that are not all alphabetic characters (do not remove
“can’t” because you have transformed it to “cant”, similarly for
“firstborn”).
6. Remove the words with less than 2 characters, like “a”.
Here's what I have so far...
def read_data(fp):
file_dict={}
fp=fp.lower
fp=fp.strip(string.punctuation)
lines=fp.readlines()
I am a little stuck, so how do I strip this file of these 6 items?
This can be accomplished via a series of regex checks and then a loop to remove all items with less than 2 characters:
import re
with open("text.txt", "r") as fi:
lowerFile = re.sub("[^\w ]", "", fi.read().lower())
lowerFile = re.sub("(^| )[^ ]*[^a-z ][^ ]*(?=$| )", "", lowerFile)
words = [word for word in lowerFile.split() if len(word) >= 2]
print(words)
I li6ke to swim, dance, and Run r8un88.
['to', 'swim', 'dance', 'and', 'run']