My text:
27/07/18, 12:02 PM - user_a: https://www.youtube.com/
Watch this
27/07/18, 12:15 PM - user_b: <Media omitted>
27/07/18, 12:52 PM - user_b: Read this fully
some text
some text
.
some text
27/07/18, 12:56 PM - user_c: text ..
Here I want to extract the messages sent by the users. I tried two regex. But I didn't get the answer I wanted
First regex:
re.findall(r''+user_name+ ':(.*)', data)
Here I couldn't able to extract the text multi lines
Second regex:
re.findall(r''+ user_name + ':[^(:)]*', data)
Here I couldn't able to extract the full text having a hyper link .i.e., I could able to get only "https". It considers the symbol ":" as an endpoint.
How do I handle this ? Any kind of suggestions would be really great & helpful
You may use the following pattern:
user_b: (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})
Regex demo here.
Note the usage of re.MULTILINE
and re.DOTALL
. The first flag is needed to match beginning of line patterns over multiline text, whereas re.DOTALL
is needed to enable the .
to match newlines too.
In Python:
import re
data = '''
27/07/18, 12:02 PM - user_a: https://www.youtube.com/
Watch this
27/07/18, 12:15 PM - user_b: <Media omitted>
27/07/18, 12:52 PM - user_b: Read this fully
some text
some text
.
some text
27/07/18, 12:56 PM - user_c: text ..
'''
usern = 'user_b'
pattern = re.compile(r""+usern+r": (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})",re.DOTALL|re.MULTILINE)
print(re.findall(pattern,data))
Prints:
['<Media omitted>\n', 'Read this fully\nsome text\nsome text\n.\nsome text\n']