regex python-2.7 regex-greedy non-greedy

python Non greedy regular expression searching too many data

String: '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'

I want to search and get only first "td" tag which contains text: "str2". so I tried two different non greedy expressions as below:

>>> mystring = '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
>>> print re.search("(<td.*?str2.*?</td>)",mystring).group(1)
<td attr="0">str1</td><td attr="5">str2</td>
>>> print re.search(".*(<td.*?str2.*?</td>).*",mystring).group(1)
<td attr="7">str2</td>

Here I was expecting output as "<td attr="5">str2</td>", because I have used non greedy expression in regular expression. What is wrong here and how to fetch the expected search result?

Note: I can not use html parser because my actual data-set is not so much formatted for xml parsing

Solution

Use [^>] instead of .:

>>> print re.search("(<td[^>]*?>str2.*?</td>)",mystring).group(1)
<td attr="5">str2</td>

(see demo)

Or, better, use HTMLParser.

EDIT: This regex will match even sub-tags:

(<td[^<]*?(?:<(?!td)[^<]*?)*str2.*?</td>)