Search code examples
pythonxmlxml-namespacesxbrl

Parse xbrl file in python


I am working on a xml parser. The goal is to parse a number of different xml files where prefixes and tags remain consistent but namespaces change.

I am hence trying either:

  • to parse the xml just by <prefix:tags> without resolving (replacing) the prefix with the namespace. The prefixes remain unchanged from document to document.
  • to load automatically the namespaces so that the identifier (<prefix:tag>) could be replaced with the proper namespace.
  • just parse the xml by tag

I have tried with xml.etree.ElementTree.

I also had a look at lxml I did not find any configuration option of the XMLParser in lxml that could help me out although here I could read an answer where the author suggests that lxml should be able to collect namespaces for me automatically.

Interestingly, parsed_file = etree.XML(file) fails with the error:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

One example of the files I would like to parse is here


Solution

  • items = tree.xpath("*[local-name(.) = 'a_tag_goes_here']")
    

    did the job. On top of that I had to browse the generated list items manually to define my other desired filtering functions.