Search code examples
pythonregexpattern-matchingregex-lookaroundsregex-greedy

Regex to match only book('knjiga') with specific name('naslov')


I have a simple xml:

<?xml version="1.0" encoding="utf-8" ?>
<book_list>
    <book rbr="1" >
        <title> Yacc </title>
        <author> Filip Maric </author>
        <year> 2004 </year>
        <publisher> Matematicki fakultet </publisher>
        <price currency="din"> 100 </price>
    </book>
    <book rbr="2" >
        <author> Fredrik Lundh </author>
        <price currency="eur"> 50 </price>
        <publisher> O’Reilly & Associates </publisher>
        <year> 2001 </year>
        <title> Python Standard Library </title>
    </book>
</book_list>

I need to match a book with a specific name with regex in Python. I can easily match any book with:

r'<book\s*rbr="\d+"\s*>.*?</book>'

(single line mode on), and then check if it is the right one, but if I want to match specific book - e.g., Python Standard Library, direct with regex, I can't get it right. If I try

r'<book\s*rbr="\d+"\s*>(?P<book>.*?<title> Python Standard Library </title>.*?)</book>'

, with single line mode on, it will match everything from the beginning and I understand why but I couldn't find the way to match only one book tag. I tried all lookups and all different modes without success.

What is the right way to do it, that will work for any number of books in book_list?


Solution

  • The problem is greatly complicated by the the fact that the <title> tag is not consistently the first child tag under <book>. If it were, you could use:

    m = re.search(r'<book\s*rbr="\d+"\s*>\s*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)
    

    That is, replacing .*? with \s*.

    The trick is to make sure that that after you have matched a <book> tag that the <title> tag you are looking for does not come after a future </book> tag. This can be accomplished with a negative lookahead (it's not pretty):

    import re
    
    xml = """<?xml version="1.0" encoding="utf-8" ?>
    <book_list>
        <book rbr="1" >
            <title> Yacc </title>
            <author> Filip Maric </author>
            <year> 2004 </year>
            <publisher> Matematicki fakultet </publisher>
            <price currency="din"> 100 </price>
        </book>
        <book rbr="2" >
            <author> Fredrik Lundh </author>
            <price currency="eur"> 50 </price>
            <publisher> O’Reilly & Associates </publisher>
            <year> 2001 </year>
            <title> Python Standard Library </title>
        </book>
    </book_list>"""
    
    m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Python Standard Library </title>).*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)
    print(m.group('book'))
    m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Yacc </title>).*(?P<book><title> Yacc </title>).*?</book>', xml, flags=re.DOTALL)
    print(m.group('book'))
    

    Prints:

    <title> Python Standard Library </title>
    <title> Yacc </title>
    

    See demo

    You can reduce the redundancy by using formatted string literals if your Python supports them (or the str.format method if it doesn't):

    title = '<title> Python Standard Library </title>'
    m = re.search(rf'<book\s*rbr="\d+"\s*>(?!.*</book>.*{title}).*(?P<book>{title}).*?</book>', xml, flags=re.DOTALL)
    

    An Alternate Approach

    This approach builds a list of all the individual <book> tags and then searches each one in order looking for the title of interest:

    # create list of <book> ... </book> strings:
    books = re.findall(r'<book\s*rbr="\d+"\s*>.*?</book>', xml, flags=re.DOTALL)
    title = '<title> Python Standard Library </title>'
    # now search each <book>...</book> string looking for the title string:
    for book in books:
        if re.search(rf'{title}', book):
            print(title)
            print(book)
    

    Prints:

    <title> Python Standard Library </title>
    <book rbr="2" >
            <author> Fredrik Lundh </author>
            <price currency="eur"> 50 </price>
            <publisher> O'Reilly & Associates </publisher>
            <year> 2001 </year>
            <title> Python Standard Library </title>
        </book>