Search code examples
pythonxmlelementtree

Poorly documented SOAP XML endpoint - overview of elements and attributes


Being new to the game, I am working (in python) with an XML list that contains multiple pages. I don't have prior knowledge of the XML structure, including the presence of elements, attributes, or nested elements. The API is poorly documented, and limited docs are produced of which elements, attributes... instances might have.

E.g. I don't even know if there is an ISBN at all. From the first 100 pages of the list, I did not find any ISBN, but who knows... A first 100 not having an ISBN, is not to say that none of the entities has an ISBN. So I need to know if there is an ISBN at all. I therefore need to 'check' if on one of those +1000 pages some books has the element 'ISBN'. If there is, I'll add it to the script to fetch that element/value if present. However, if it is not there with any of the books, I won't bother. Second, since I don't know how the XML is structured, I don't know whether the ISBN will be an actual element or an attribute as in below:

<book isbn="9780747532699">
    <title>Harry Potter and the Philosopher's Stone</title>
    <author>J.K. Rowling</author>
    <publicationYear>1997</publicationYear>
    <genre>Fantasy</genre>
    <publisher>Bloomsbury Publishing</publisher>
    <language>English</language>
    <price>19.99</price>
</book>

rather than an element as follows:

<book>
    <isbn>9780747532699</isbn>
    <title>Harry Potter and the Philosopher's Stone</title>
    <author>J.K. Rowling</author>
    <publicationYear>1997</publicationYear>
    <genre>Fantasy</genre>
    <publisher>Bloomsbury Publishing</publisher>
    <language>English</language>
    <price>19.99</price>
</book>

This applies to all elements. Some of which I have no idea whether they are there or not.

Additionally, some elements will most likely be nested. In case of multilingual abstracts, I noticed they are indeed nested in a container element 'abstracts'. See below:

<Collection>
    <poetry>
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <publicationYear>1925</publicationYear>
    </poetry>
    <book>
        <title>Pride and Prejudice</title>
        <author>Jane Austen</author>
        <publicationYear>1813</publicationYear>
    </book>
    <novel>
        <title>1984</title>
        <author>George Orwell</author>
        <publicationYear>1949</publicationYear>
        <abstracts>
            <abstract it="Italian">Un romanzo distopico che descrive un futuro dominato da un regime autoritario.</abstract>
        </abstracts>
    </novel>
    <journal>
        <title>Some journal</title>
        <editor>George Handsome</editor>
        <publicationYear>1949</publicationYear>
    </journal>
</Collection>

So, I need to know how the attributes/elements are to be found in the list of XML pages. So therefore I hoped that there would be a way to

  1. query the endpoint,
  2. iterate over the +1000 pages
  3. build some kind of nested structure of elements, childs... and attributes so I can clearly see what can be found (if present), how it is stored
  4. write out the scripts to fetch the actual elements/attributes based on the deduced structure.

Questions:

  1. Should I keep exploring BeautifulSoup and ElementTree? Any pointers if so?
  2. What other solutions/recommendations are there to explore/understand the XML, its elements and the attributes?
  3. Am I delusional and barking up the wrong tree?

Solution

  • One option is to process all of the elements and attributes in the XML and capture the unique xpath's for each one. Getting the xpath of an element is easy using lxml.

    This will basically map out the entire structure for you showing not only the element and attribute names, but also where they appear in the tree.

    Example:

    from pprint import pprint
    from lxml import etree
    
    sample_xml1 = """
    <book isbn="9780747532699">
        <title>Harry Potter and the Philosopher's Stone</title>
        <author>J.K. Rowling</author>
        <publicationYear>1997</publicationYear>
        <genre>Fantasy</genre>
        <publisher>Bloomsbury Publishing</publisher>
        <language>English</language>
        <price>19.99</price>
    </book>
    """
    
    sample_xml2 = """
    <book>
        <isbn>9780747532699</isbn>
        <title>Harry Potter and the Philosopher's Stone</title>
        <author>J.K. Rowling</author>
        <publicationYear>1997</publicationYear>
        <genre>Fantasy</genre>
        <publisher>Bloomsbury Publishing</publisher>
        <language>English</language>
        <price>19.99</price>
    </book>
    """
    
    sample_xml = [sample_xml1, sample_xml2]
    
    analysis_results = set()
    
    for xml in sample_xml:
        tree = etree.ElementTree(etree.fromstring(xml))
    
        for elem in tree.xpath("//*"):
            xpath = tree.getpath(elem)
            analysis_results.add(xpath)
            for attr in elem.attrib:
                analysis_results.add(f"{xpath}/@{attr}")
    
    pprint(analysis_results)
    

    Printed output:

    {'/book',
     '/book/@isbn',
     '/book/author',
     '/book/genre',
     '/book/isbn',
     '/book/language',
     '/book/price',
     '/book/publicationYear',
     '/book/publisher',
     '/book/title'}