Search code examples
pythonstringsortingalphabetical

Create a list of alphabetically sorted UNIQUE words and display the first N words in python


I am new to Python, apologize for a simple question. My task is the following:

Create a list of alphabetically sorted unique words and display the first 5 words

I have text variable, which contains a lot of text information

I did

test = text.split()
sorted(test)

As a result, I receive a list, which starts from symbols like $ and numbers.

How to get to words and print N number of them.


Solution

  • I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.

    text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $@"
    # Extract only the words that consist of alphabets
    words = filter(lambda x: x.isalpha(), text.split(' '))
    # Print the first 5 words
    sorted(set(words))[:5]
    

    Output-

    ['atop', 'king', 'mountain', 'of', 'peak']
    

    But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-

    For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.

    We'll be using re.match instead of .isalpha this time.

    WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
    text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $@"
    # Extract only the words that consist of alphabets
    words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
    # Print the first 5 words
    sorted(set(words))[:5]
    

    Output-

    ['atop', 'king', 'mountain', "mountain's", 'of']
    

    Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.

    Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question