python regex pattern-matching regex-lookarounds regex-greedy

Regex to match only book('knjiga') with specific name('naslov')

I have a simple xml:

<?xml version="1.0" encoding="utf-8" ?>
<book_list>
    <book rbr="1" >
        <title> Yacc </title>
        <author> Filip Maric </author>
        <year> 2004 </year>
        <publisher> Matematicki fakultet </publisher>
        <price currency="din"> 100 </price>
    </book>
    <book rbr="2" >
        <author> Fredrik Lundh </author>
        <price currency="eur"> 50 </price>
        <publisher> O’Reilly & Associates </publisher>
        <year> 2001 </year>
        <title> Python Standard Library </title>
    </book>
</book_list>

I need to match a book with a specific name with regex in Python. I can easily match any book with:

r'<book\s*rbr="\d+"\s*>.*?</book>'

(single line mode on), and then check if it is the right one, but if I want to match specific book - e.g., Python Standard Library, direct with regex, I can't get it right. If I try

r'<book\s*rbr="\d+"\s*>(?P<book>.*?<title> Python Standard Library </title>.*?)</book>'

, with single line mode on, it will match everything from the beginning and I understand why but I couldn't find the way to match only one book tag. I tried all lookups and all different modes without success.

What is the right way to do it, that will work for any number of books in book_list?

Solution

The problem is greatly complicated by the the fact that the <title> tag is not consistently the first child tag under <book>. If it were, you could use:

m = re.search(r'<book\s*rbr="\d+"\s*>\s*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)

That is, replacing .*? with \s*.

The trick is to make sure that that after you have matched a <book> tag that the <title> tag you are looking for does not come after a future </book> tag. This can be accomplished with a negative lookahead (it's not pretty):

import re

xml = """<?xml version="1.0" encoding="utf-8" ?>
<book_list>
    <book rbr="1" >
        <title> Yacc </title>
        <author> Filip Maric </author>
        <year> 2004 </year>
        <publisher> Matematicki fakultet </publisher>
        <price currency="din"> 100 </price>
    </book>
    <book rbr="2" >
        <author> Fredrik Lundh </author>
        <price currency="eur"> 50 </price>
        <publisher> O’Reilly & Associates </publisher>
        <year> 2001 </year>
        <title> Python Standard Library </title>
    </book>
</book_list>"""

m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Python Standard Library </title>).*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)
print(m.group('book'))
m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Yacc </title>).*(?P<book><title> Yacc </title>).*?</book>', xml, flags=re.DOTALL)
print(m.group('book'))

Prints:

<title> Python Standard Library </title>
<title> Yacc </title>

See demo

You can reduce the redundancy by using formatted string literals if your Python supports them (or the str.format method if it doesn't):

title = '<title> Python Standard Library </title>'
m = re.search(rf'<book\s*rbr="\d+"\s*>(?!.*</book>.*{title}).*(?P<book>{title}).*?</book>', xml, flags=re.DOTALL)

An Alternate Approach

This approach builds a list of all the individual <book> tags and then searches each one in order looking for the title of interest:

# create list of <book> ... </book> strings:
books = re.findall(r'<book\s*rbr="\d+"\s*>.*?</book>', xml, flags=re.DOTALL)
title = '<title> Python Standard Library </title>'
# now search each <book>...</book> string looking for the title string:
for book in books:
    if re.search(rf'{title}', book):
        print(title)
        print(book)

Prints:

<title> Python Standard Library </title>
<book rbr="2" >
        <author> Fredrik Lundh </author>
        <price currency="eur"> 50 </price>
        <publisher> O'Reilly & Associates </publisher>
        <year> 2001 </year>
        <title> Python Standard Library </title>
    </book>