Search code examples
pythonfilestrip

how to strip a txt file of multiple things?


I am creating a function which reads data of a txt file, the text file is set up as one sentence per line. I have 6 requirements to strip the file of to make it usable later on in my program:

 1. Make everything lowercase
 2. Split the line into words
 3. Remove all punctuation, such as “,”, “.”, “!”, etc.
 4. Remove apostrophes and hyphens, e.g. transform “can’t” into “cant” and 
 “first-born” into “firstborn”
 5. Remove the words that are not all alphabetic characters (do not remove 
 “can’t” because you have transformed it to “cant”, similarly for 
 “firstborn”).
 6. Remove the words with less than 2 characters, like “a”. 

Here's what I have so far...

def read_data(fp):
    file_dict={}
    fp=fp.lower
    fp=fp.strip(string.punctuation)
    lines=fp.readlines()

I am a little stuck, so how do I strip this file of these 6 items?


Solution

  • This can be accomplished via a series of regex checks and then a loop to remove all items with less than 2 characters:

    Code

    import re
    
    with open("text.txt", "r") as fi:
        lowerFile = re.sub("[^\w ]", "", fi.read().lower())
        lowerFile = re.sub("(^| )[^ ]*[^a-z ][^ ]*(?=$| )", "", lowerFile)
        words = [word for word in lowerFile.split() if len(word) >= 2]
        print(words)
    

    Input

    I li6ke to swim, dance, and Run r8un88.
    

    Output

    ['to', 'swim', 'dance', 'and', 'run']