Search code examples

Extract data from ORCID XML files using Python

I ma trying to (offline) parse names from ORCID XML files using Python, which is downloaded from :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<record:record xmlns:internal="" xmlns:address="" xmlns:email="" xmlns:history="" xmlns:employment="" xmlns:person="" xmlns:education="" xmlns:other-name="" xmlns:personal-details="" xmlns:bulk="" xmlns:common="" xmlns:record="" xmlns:keyword="" xmlns:activities="" xmlns:deprecated="" xmlns:external-identifier="" xmlns:funding="" xmlns:error="" xmlns:preferences="" xmlns:work="" xmlns:researcher-url="" xmlns:peer-review="" path="/0000-0001-5006-8001">

     <person:person path="/0000-0001-5006-8001/person">
    <person:name visibility="public" path="0000-0001-5006-8001">

What I want is to extract given-names and family-name: Marjorie Biffi. I am trying to use this code:

>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('f.xml').getroot()
>>> p=root.findall('{}personal-details')
>>> p

I can't figure out how to extract names/surname from this XML file. I am trying also yo use XPath/Selector, but no succes.


  • This will get you the results you want, but by climbing through each one.

    p1 = root.find('{}person')
    name = p1.find('{}name')
    given_names =  name.find('{}given-names')
    family_name = name.find('{}family-name')
    print(given_names.text, '', family_name.text)

    You could also just go directly to that sublevel with .\\

    family_name = root.find('.//{}family-name')

    Also I just posted here about simpler ways to parse through xml if you're doing more basic operations. These include xmltodict (converting to an OrderedDict) or untangle which is a little inefficient but very quick and easy to learn.