Search code examples
htmlregexpython-3.xregex-greedy

Greedy and Lazy quantifier. Testing with HTML tags


Input is

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>

Expected First Output is ( as I am using greedy quantifier)

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>

Code used for Greedy as below

text = '''
<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>
'''
pattern=re.compile(r'\<p\>.*\<\/p\>')
data1=pattern.match(text,re.MULTILINE)
print('data1:- ',data1,'\n')

Expected second Output is ( as I am using Lazy quantifier)

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>

Code used for lazy is as below

text = '''
<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>
'''
#pattern=re.compile(r'\<p\>.*?\<\/p\>')
pattern=re.compile(r'<p>.*?</p>')
data1=pattern.match(text,re.MULTILINE)
print('data1:- ',data1,'\n')

I am getting None is both case as Actual Output


Solution

  • You have a couple of issues. Firstly, when using Pattern.match, the second and third parameters are positional, not flags. The flags need to be specified to re.compile. Secondly, you should be using re.DOTALL to make . match newline, not re.MULTILINE. Finally - match insists that the match occurs at the beginning of the string (which in your case is a newline character), so it won't match. You might want to use Pattern.search instead. This will work for your sample input:

    pattern=re.compile(r'<p>.*</p>', re.DOTALL)
    data1=pattern.search(text)
    print('data1:- ',data1.group(0),'\n')
    

    Output:

    data1:-  <p>
    The very <em>first</em> task is to find the beginning of a paragraph.
    </p>
    <p>
    Then you have to find the end of the paragraph
    </p> 
    

    Single match:

    pattern=re.compile(r'<p>.*?</p>', re.DOTALL)
    data1=pattern.search(text)
    print('data1:- ',data1.group(0),'\n')
    

    Output:

    data1:-  <p>
    The very <em>first</em> task is to find the beginning of a paragraph.
    </p> 
    

    Note also that /, < and > have no special meaning in regexes and don't need to be escaped. I've removed that in my code above.