Search code examples
pythonbeautifulsoupxbrl

Parse id from XBRL file with BeautifulSoup


I have an issue with scraping an XBRL file using Beautifulsoup.

Code:

openxbrl = open(file.file_path, 'r')
readxbrl = openxbrl.read()    
contextsoup = xbrlsoup.findAll('xbrli:context')
print(contextsoup)

Gives the following output (sample, there are multiple children)

  <xbrli:context id="context0">
     <xbrli:period>
      <xbrli:instant>
       2020-12-31
      </xbrli:instant>
     </xbrli:period>

I can't seem to figure out how I can parse the context id: id="context0" without printing the whole contextsoup. I tried to print the id by parsing the name:

 for child in contextsoup:
    pprint.pprint(child.name)
    pprint.pprint(child.find('xbrli:period'))

But does not give me the id

  'xbrli:context'
    <xbrli:period>
     <xbrli:instant>
      2020-12-31
     </xbrli:instant>
    </xbrli:period>

How can I parse the id without printing the whole xbrl?


Solution

  • id="context0" is not part of the element name, it is an attribute (BS docs on attributes)

    You can access attribute values by treating a tag as a dict:

    for context in contextsoup:
        print(context['id'])
    

    You can also find tags directly by attribute values. The value of the id attribute should be unique across the document, so you can just do:

    soup.find(id='context0')
    

    You should also be aware that you are working with namespaced XML; if you are working with different XBRL reports, you can't rely on the context tags always being called xbrli:context because the xbrli bit is a document-defined prefix that provides a shorthand for a namespace URI. I believe that Beautiful Soup 4 does have some namespace support, but I've not used it.

    There is quite a lot of complexity in correctly parsing XBRL in its XML format, and I would recommend using an existing XBRL processor to do it. One of the best ways of dealing with XBRL reports is to convert it to the newer xBRL-JSON format, and then work with it as JSON data. There are a number of tools that can do this conversion, including the open source Arelle project.