String: '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
I want to search and get only first "td" tag which contains text: "str2". so I tried two different non greedy expressions as below:
>>> mystring = '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
>>> print re.search("(<td.*?str2.*?</td>)",mystring).group(1)
<td attr="0">str1</td><td attr="5">str2</td>
>>> print re.search(".*(<td.*?str2.*?</td>).*",mystring).group(1)
<td attr="7">str2</td>
Here I was expecting output as "<td attr="5">str2</td>"
, because I have used non greedy expression in regular expression. What is wrong here and how to fetch the expected search result?
Note: I can not use html parser because my actual data-set is not so much formatted for xml parsing
Use [^>]
instead of .
:
>>> print re.search("(<td[^>]*?>str2.*?</td>)",mystring).group(1)
<td attr="5">str2</td>
(see demo)
Or, better, use HTMLParser.
EDIT: This regex will match even sub-tags:
(<td[^<]*?(?:<(?!td)[^<]*?)*str2.*?</td>)