Search code examples
pythonxmlxpathlxml

python lxml xpath query fails on hardcoded url but works on a bytes string


I am trying to extract an xml attribute parsable-cite from the text tag. I am parsing an xml from the url "https://www.congress.gov/118/bills/hr61/BILLS-118hr61ih.xml".

The code I'm using is the following (Replit here https://replit.com/join/ohhztxpqdr-aam88) and writing here for convenience:

from lxml import etree
import requests

response = requests.get(url)
xml_response = response.content

tree = etree.fromstring(xml_response)
result = tree.xpath("//text[contains(., 'is amended')]")

for r in result:
  external_xref = r.find("external-xref")
  print(external_xref.attrib)

I get an error conveying that I'm accessing None and that the xpath didn't find the search.

AttributeError: 'NoneType' object has no attribute 'attrib'

When I use the same code and instead use the snippet of the text node directly, I get the following:

text = b’<text display-inline="no-display-inline">Section 4702 of the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act (<external-xref legal-doc="usc" parsable-cite="usc/18/249">18 U.S.C. 249</external-xref> note) is amended by adding at the end the following: </text>’

tree = etree.fromstring(text)
result = tree.xpath("//text[contains(., 'is amended')]")

for r in result:
  external_xref = r.find("external-xref")
  print(external_xref.attrib)
{'legal-doc': 'usc', 'parsable-cite': 'usc/18/249'}

The issue seems to come from processing the content from the url directly. Any recommendations on how to proceed?

Thanks


Solution

  • In https://www.congress.gov/118/bills/hr61/BILLS-118hr61ih.xml, there are two text elements that contain the string "is amended". But only one of them (the second one) has an external-xref child element.

    The following update of the code will produce the wanted output:

    for r in result:
        external_xref = r.find("external-xref")
        if external_xref is not None:    # Check if there actually is an external-xref
            print(external_xref.attrib)