I have a simple xml:
<?xml version="1.0" encoding="utf-8" ?>
<book_list>
<book rbr="1" >
<title> Yacc </title>
<author> Filip Maric </author>
<year> 2004 </year>
<publisher> Matematicki fakultet </publisher>
<price currency="din"> 100 </price>
</book>
<book rbr="2" >
<author> Fredrik Lundh </author>
<price currency="eur"> 50 </price>
<publisher> O’Reilly & Associates </publisher>
<year> 2001 </year>
<title> Python Standard Library </title>
</book>
</book_list>
I need to match a book with a specific name with regex in Python. I can easily match any book with:
r'<book\s*rbr="\d+"\s*>.*?</book>'
(single line mode on), and then check if it is the right one, but if I want to match specific book - e.g., Python Standard Library, direct with regex, I can't get it right. If I try
r'<book\s*rbr="\d+"\s*>(?P<book>.*?<title> Python Standard Library </title>.*?)</book>'
, with single line mode on, it will match everything from the beginning and I understand why but I couldn't find the way to match only one book tag. I tried all lookups and all different modes without success.
What is the right way to do it, that will work for any number of books in book_list?
The problem is greatly complicated by the the fact that the <title>
tag is not consistently the first child tag under <book>
. If it were, you could use:
m = re.search(r'<book\s*rbr="\d+"\s*>\s*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)
That is, replacing .*?
with \s*
.
The trick is to make sure that that after you have matched a <book>
tag that the <title>
tag you are looking for does not come after a future </book>
tag. This can be accomplished with a negative lookahead (it's not pretty):
import re
xml = """<?xml version="1.0" encoding="utf-8" ?>
<book_list>
<book rbr="1" >
<title> Yacc </title>
<author> Filip Maric </author>
<year> 2004 </year>
<publisher> Matematicki fakultet </publisher>
<price currency="din"> 100 </price>
</book>
<book rbr="2" >
<author> Fredrik Lundh </author>
<price currency="eur"> 50 </price>
<publisher> O’Reilly & Associates </publisher>
<year> 2001 </year>
<title> Python Standard Library </title>
</book>
</book_list>"""
m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Python Standard Library </title>).*(?P<book><title> Python Standard Library </title>).*?</book>', xml, flags=re.DOTALL)
print(m.group('book'))
m = re.search(r'<book\s*rbr="\d+"\s*>(?!.*</book>.*<title> Yacc </title>).*(?P<book><title> Yacc </title>).*?</book>', xml, flags=re.DOTALL)
print(m.group('book'))
Prints:
<title> Python Standard Library </title>
<title> Yacc </title>
You can reduce the redundancy by using formatted string literals if your Python supports them (or the str.format
method if it doesn't):
title = '<title> Python Standard Library </title>'
m = re.search(rf'<book\s*rbr="\d+"\s*>(?!.*</book>.*{title}).*(?P<book>{title}).*?</book>', xml, flags=re.DOTALL)
An Alternate Approach
This approach builds a list of all the individual <book>
tags and then searches each one in order looking for the title of interest:
# create list of <book> ... </book> strings:
books = re.findall(r'<book\s*rbr="\d+"\s*>.*?</book>', xml, flags=re.DOTALL)
title = '<title> Python Standard Library </title>'
# now search each <book>...</book> string looking for the title string:
for book in books:
if re.search(rf'{title}', book):
print(title)
print(book)
Prints:
<title> Python Standard Library </title>
<book rbr="2" >
<author> Fredrik Lundh </author>
<price currency="eur"> 50 </price>
<publisher> O'Reilly & Associates </publisher>
<year> 2001 </year>
<title> Python Standard Library </title>
</book>