I'm currently in the process of wrangling bibliographic information from an XML structure into, literally, almost anything else useable. My final move is to pull the value of the "xml:id" attribute and add that to my nice dataframe. I've got everything else working nicely in ElementTree and pandas in python.
eg: I want to pull "Kagawa2014" from biblStruct below:
<biblStruct type="book" xml:id="Kagawa2014" corresp="http://zotero.org/users/local/fmahZILk/items/EAK64XAU">
<monogr>
#blahblah
</monogr>
</biblStruct>
I've tried a few things I've found on stack overflow:
for biblStruct in root.findall('.//tei:biblStruct', namespace):
id_elem = biblStruct.attrib('xml:id')
and received TypeError: 'dict' object is not callable and this, which I had a lot of hope for:
for biblStruct in root.findall('.//tei:biblStruct', namespace):
id_elem = biblStruct.get('{http://w3.org/XML/1998/namespace}id')
id_text = id_elem.text if id_elem is not None else ''
xmlID.append(id_text)
data = {
'XML_ID':xmlID
}
df = pd.DataFrame(data)
print(df)
This returned a DF that just counted the biblStructs (the correct number) (i.e. 0,1,2,3,4, etc etc) also:
for biblStruct in root.findall('.//tei:biblStruct', namespace):
id_elem = biblStruct.get('{http://w3.org/XML/1998/namespace}id')
xmlID.append(id_elem)
data_again = {
'XML_ID': xmlID
}
df_again = pd.DataFrame(data_again)
print(df_again)
This returned a DF like the above, only now, TWICE as many! Like magic.
I managed to make it work. to extract the value of the xml:id from the snippet
xml_id = biblStruct.get('{http://www.w3.org/XML/1998/namespace}id',
'')
My first attempts were missing the '' at the end.