I have multiple xml
files from PubMed. Several files are here.
How to parse it and get these columns in a single dataframe. If an article has several authors, I want to have them as separate rows
Expected output (all authors should be included):
Title Year ArticleTitle LastName ForeName
Nature 2021 Inter-mosaic ... Roy Suva
Nature 2021 Inter-mosaic ... Pearson John
Nature 2021 Neural dynamics Pearson John
Nature 2021 Neural dynamics Mooney Richard
First, what you want is doable. Something like this should work for your second file, and you could add other files by wrapping the code with a for
loop:
from lxml import etree
import pandas as pd
doc = etree.parse('file.xml')
columns = ['Title','ArticleDate','ArticleTitle','LastName','ForeName']
title = doc.xpath(f'//{columns[0]}/text()')[0]
year = doc.xpath(f'//{columns[1]}//Year/text()')[0]
article_title = doc.xpath(f'//{columns[2]}/text()')[0]
rows = []
for auth in doc.xpath('//Author'):
last_name = auth.xpath(f'{columns[3]}/text()')[0]
fore_name = auth.xpath(f'{columns[4]}/text()')[0]
rows.append([title,year,article_title,last_name,fore_name])
pd.DataFrame(rows,columns=columns)
Output (for 34671166.xml):
Title ArticleDate ArticleTitle LastName ForeName
0 Nature 2021 Neural dynamics underlying birdsong practice a... Singh Alvarado Jonnathan
1 Nature 2021 Neural dynamics underlying birdsong practice a... Goffinet Jack
2 Nature 2021 Neural dynamics underlying birdsong practice a... Michael Valerie
3 Nature 2021 Neural dynamics underlying birdsong practice a... Liberti William
4 Nature 2021 Neural dynamics underlying birdsong practice a... Hatfield Jordan
5 Nature 2021 Neural dynamics underlying birdsong practice a... Gardner Timothy
6 Nature 2021 Neural dynamics underlying birdsong practice a... Pearson John
7 Nature 2021 Neural dynamics underlying birdsong practice a... Mooney Richard
Having said all that, I'm not sure a dataframe with each author in a separate line is the best idea for the type of data you have. In this example, since you have 8 co-authors, information such as the article title is repeated unnecessarily 8 times. You could give each author a separate set of columns, but then you'll have problems where articles have 3 or 10 co-authors...