I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text
variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $
and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter
to first get rid of the unwanted strings, turn it into a set
, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $@"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's
, because of that pesky '
. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$
, which means the string must only contain alphabets and '
, you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match
instead of .isalpha
this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $@"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?
. hi!
, name?
are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi
instead of hi!
, name
instead of name?
in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question