I have an issue with scraping an XBRL file using Beautifulsoup.
Code:
openxbrl = open(file.file_path, 'r')
readxbrl = openxbrl.read()
contextsoup = xbrlsoup.findAll('xbrli:context')
print(contextsoup)
Gives the following output (sample, there are multiple children)
<xbrli:context id="context0">
<xbrli:period>
<xbrli:instant>
2020-12-31
</xbrli:instant>
</xbrli:period>
I can't seem to figure out how I can parse the context id: id="context0"
without printing the whole contextsoup
. I tried to print the id by parsing the name:
for child in contextsoup:
pprint.pprint(child.name)
pprint.pprint(child.find('xbrli:period'))
But does not give me the id
'xbrli:context'
<xbrli:period>
<xbrli:instant>
2020-12-31
</xbrli:instant>
</xbrli:period>
How can I parse the id without printing the whole xbrl?
id="context0"
is not part of the element name, it is an attribute (BS docs on attributes)
You can access attribute values by treating a tag as a dict:
for context in contextsoup:
print(context['id'])
You can also find tags directly by attribute values. The value of the id
attribute should be unique across the document, so you can just do:
soup.find(id='context0')
You should also be aware that you are working with namespaced XML; if you are working with different XBRL reports, you can't rely on the context tags always being called xbrli:context
because the xbrli
bit is a document-defined prefix that provides a shorthand for a namespace URI. I believe that Beautiful Soup 4 does have some namespace support, but I've not used it.
There is quite a lot of complexity in correctly parsing XBRL in its XML format, and I would recommend using an existing XBRL processor to do it. One of the best ways of dealing with XBRL reports is to convert it to the newer xBRL-JSON format, and then work with it as JSON data. There are a number of tools that can do this conversion, including the open source Arelle project.