Search code examples
pythonparsingtext-processingtext-parsing

Finding the longest word in a .txt file without punctuation marks


I am doing Python file I/O exercises and albeit made a huge progress on an exercise in which I try to find the longest words in each line of a .txt file, I can't get rid of the punctuation marks.

Here is the code I have:

with open("original-3.txt", 'r') as file1:
lines = file1.readlines()
for line in lines:
    if not line == "\n":
        print(max(line.split(), key=len))

This is the output I get

This is the original-3.txt file where I am reading the data from

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"

He took his vorpal sword in hand:
Long time the manxome foe he sought,
So rested he by the Tumtum tree,
And stood a while in thought.

And, as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One two! One two! And through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

"And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!"
"Oh frabjous day! Callooh! Callay!"
He chortled in his joy.

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

As you can see, I am getting the punctuation marks like ["," ";" "?" "!"]

How do you think I can only get the words themselves?

Thank you


Solution

  • You have to strip those characters from the words:

    with open("original-3.txt", 'r') as file1:
        lines = file1.readlines()
    for line in lines:
        if not line == "\n":
            print(max(word.strip(",?;!\"") for word in line.split()), key=len))
    

    or you use regular expressions to extract everything that looks like a word (i.e. consists of letters):

    import re
    
    
    for line in lines: 
        words = re.findall(r"\w+", line) 
        if words: 
            print(max(words, key=len))