Search code examples
regexpython-2.7regex-greedynon-greedy

python Non greedy regular expression searching too many data


String: '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'

I want to search and get only first "td" tag which contains text: "str2". so I tried two different non greedy expressions as below:

>>> mystring = '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
>>> print re.search("(<td.*?str2.*?</td>)",mystring).group(1)
<td attr="0">str1</td><td attr="5">str2</td>
>>> print re.search(".*(<td.*?str2.*?</td>).*",mystring).group(1)
<td attr="7">str2</td>

Here I was expecting output as "<td attr="5">str2</td>", because I have used non greedy expression in regular expression. What is wrong here and how to fetch the expected search result?

Note: I can not use html parser because my actual data-set is not so much formatted for xml parsing


Solution

  • Use [^>] instead of .:

    >>> print re.search("(<td[^>]*?>str2.*?</td>)",mystring).group(1)
    <td attr="5">str2</td>
    

    (see demo)

    Or, better, use HTMLParser.

    EDIT: This regex will match even sub-tags:

    (<td[^<]*?(?:<(?!td)[^<]*?)*str2.*?</td>)