I have a string, resulting from some machine learning algorithm, which is generally formed by multiple lines. At the beginning and at the end there can be some lines not containing any characters (except for whitespaces), and in between there should be 2 lines, each containing a word followed by some numbers and (sometimes) other characters.
Something like this
first_word 3 5 7 @ 4
second_word 4 5 67| 5 [
I need to extract the 2 words and the numeric characters.
I can eliminate the empty lines by doing something like:
lines_list = initial_string.split("\n")
for line in lines_list:
if len(line) > 0 and not line.isspace():
print(line)
but now I was wondering:
I imagine reg expressions could be useful, but I never really used them, so I'm struggling a little bit at the moment
I would use re.findall here:
inp = '''first_word 3 5 7 @ 4
second_word 4 5 67| 5 ['''
matches = re.findall(r'\w+', inp)
print(matches) # ['first_word', '3', '5', '7', '4', 'second_word', '4', '5', '67', '5']
If you want to process each line separately, then simply split in the input on CR?LF and use the same approach:
inp = '''first_word 3 5 7 @ 4
second_word 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
matches = re.findall(r'\w+', line)
print(matches)
This prints:
['first_word', '3', '5', '7', '4']
['second_word', '4', '5', '67', '5']