Search code examples
pythonpandasxmlxml-parsingelementtree

How would I extract from xml the value of "xml:id" in python using ElementTree into a dataframe


I'm currently in the process of wrangling bibliographic information from an XML structure into, literally, almost anything else useable. My final move is to pull the value of the "xml:id" attribute and add that to my nice dataframe. I've got everything else working nicely in ElementTree and pandas in python.

eg: I want to pull "Kagawa2014" from biblStruct below:

<biblStruct type="book" xml:id="Kagawa2014" corresp="http://zotero.org/users/local/fmahZILk/items/EAK64XAU">
    <monogr>
#blahblah
    </monogr>
</biblStruct>

I've tried a few things I've found on stack overflow:

for biblStruct in root.findall('.//tei:biblStruct', namespace):
    id_elem = biblStruct.attrib('xml:id')

and received TypeError: 'dict' object is not callable and this, which I had a lot of hope for:

for biblStruct in root.findall('.//tei:biblStruct', namespace):
    id_elem = biblStruct.get('{http://w3.org/XML/1998/namespace}id')
    id_text = id_elem.text if id_elem is not None else ''
    xmlID.append(id_text)

    
data = {
    'XML_ID':xmlID
    }
df = pd.DataFrame(data)
print(df)

This returned a DF that just counted the biblStructs (the correct number) (i.e. 0,1,2,3,4, etc etc) also:

for biblStruct in root.findall('.//tei:biblStruct', namespace):
    id_elem = biblStruct.get('{http://w3.org/XML/1998/namespace}id')
    xmlID.append(id_elem)

    
data_again = {
    'XML_ID': xmlID
    }
df_again = pd.DataFrame(data_again)
print(df_again)

This returned a DF like the above, only now, TWICE as many! Like magic.


Solution

  • I managed to make it work. to extract the value of the xml:id from the snippet

        xml_id = biblStruct.get('{http://www.w3.org/XML/1998/namespace}id',
    '')
    

    My first attempts were missing the '' at the end.