Search code examples
regexpython-3.xtext-extractionregex-greedy

How to extract text between certain patterns using regular expression (RegEx)?


My text:

27/07/18, 12:02 PM - user_a: https://www.youtube.com/
 Watch this
27/07/18, 12:15 PM - user_b: <Media omitted>
27/07/18, 12:52 PM - user_b: Read this fully
some text
some text
.
some text
27/07/18, 12:56 PM - user_c: text ..

Here I want to extract the messages sent by the users. I tried two regex. But I didn't get the answer I wanted

First regex:

re.findall(r''+user_name+ ':(.*)', data)

Here I couldn't able to extract the text multi lines

Second regex:

re.findall(r''+ user_name + ':[^(:)]*', data)

Here I couldn't able to extract the full text having a hyper link .i.e., I could able to get only "https". It considers the symbol ":" as an endpoint.

How do I handle this ? Any kind of suggestions would be really great & helpful


Solution

  • You may use the following pattern:

    user_b: (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})
    

    Regex demo here.

    Note the usage of re.MULTILINE and re.DOTALL. The first flag is needed to match beginning of line patterns over multiline text, whereas re.DOTALL is needed to enable the . to match newlines too.


    In Python:

    import re
    data = '''
    27/07/18, 12:02 PM - user_a: https://www.youtube.com/
     Watch this
    27/07/18, 12:15 PM - user_b: <Media omitted>
    27/07/18, 12:52 PM - user_b: Read this fully
    some text
    some text
    .
    some text
    27/07/18, 12:56 PM - user_c: text ..
    '''
    usern = 'user_b'
    
    pattern = re.compile(r""+usern+r": (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})",re.DOTALL|re.MULTILINE)
    print(re.findall(pattern,data))
    

    Prints:

    ['<Media omitted>\n', 'Read this fully\nsome text\nsome text\n.\nsome text\n']