I have a series of a large XML files that I would like to import into a pandas DataFrame. All the XML data files have the same structure. Here is one of those files as a sample:
<clinical_study>
<!-- This xml conforms to an XML Schema at:
https://clinicaltrials.gov/ct2/html/images/info/public.xsd -->
<required_header>
<download_date>ClinicalTrials.gov processed this data on September 19, 2022</download_date>
<link_text>Link to the current ClinicalTrials.gov record.</link_text>
<url>https://clinicaltrials.gov/show/NCT00000102</url>
</required_header>
<id_info>
<org_study_id>NCRR-M01RR01070-0506</org_study_id>
<secondary_id>M01RR001070</secondary_id>
<nct_id>NCT00000102</nct_id>
</id_info>
<brief_title>Congenital Adrenal Hyperplasia: Calcium Channels as Therapeutic Targets</brief_title>
<sponsors>
<lead_sponsor>
<agency>National Center for Research Resources (NCRR)</agency>
<agency_class>NIH</agency_class>
</lead_sponsor>
</sponsors>
<source>National Center for Research Resources (NCRR)</source>
<brief_summary>
<textblock>
This study will test the ability of extended release nifedipine (Procardia XL), a blood
pressure medication, to permit a decrease in the dose of glucocorticoid medication children
take to treat congenital adrenal hyperplasia (CAH).
</textblock>
</brief_summary>
<detailed_description>
<textblock>
This protocol is designed to assess both acute and chronic effects of the calcium channel
antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with
congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will
involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine
the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels,
as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II
is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release
by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA
axis? Such a decrease would, in turn, reduce the deleterious effects of glucocorticoid
treatment in CAH.
</textblock>
</detailed_description>
<overall_status>Completed</overall_status>
<phase>Phase 1/Phase 2</phase>
<study_type>Interventional</study_type>
<has_expanded_access>No</has_expanded_access>
<study_design_info>
<intervention_model>Parallel Assignment</intervention_model>
<primary_purpose>Treatment</primary_purpose>
<masking>Double</masking>
</study_design_info>
<condition>Congenital Adrenal Hyperplasia</condition>
<intervention>
<intervention_type>Drug</intervention_type>
<intervention_name>Nifedipine</intervention_name>
</intervention>
<eligibility>
<criteria>
<textblock>
Inclusion Criteria:

- diagnosed with Congenital Adrenal Hyperplasia (CAH)

- normal ECG during baseline evaluation

Exclusion Criteria:

- history of liver disease, or elevated liver function tests

- history of cardiovascular disease
</textblock>
</criteria>
<gender>All</gender>
<minimum_age>14 Years</minimum_age>
<maximum_age>35 Years</maximum_age>
<healthy_volunteers>No</healthy_volunteers>
</eligibility>
<location>
<facility>
<name>Medical University of South Carolina</name>
<address>
<city>Charleston</city>
<state>South Carolina</state>
<country>United States</country>
</address>
</facility>
</location>
<location_countries>
<country>United States</country>
</location_countries>
<verification_date>January 2004</verification_date>
<study_first_submitted>November 3, 1999</study_first_submitted>
<study_first_submitted_qc>November 3, 1999</study_first_submitted_qc>
<study_first_posted type="Estimate">November 4, 1999</study_first_posted>
<last_update_submitted>June 23, 2005</last_update_submitted>
<last_update_submitted_qc>June 23, 2005</last_update_submitted_qc>
<last_update_posted type="Estimate">June 24, 2005</last_update_posted>
<condition_browse>
<!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm -->
<mesh_term>Adrenal Hyperplasia, Congenital</mesh_term>
<mesh_term>Adrenogenital Syndrome</mesh_term>
<mesh_term>Adrenocortical Hyperfunction</mesh_term>
<mesh_term>Hyperplasia</mesh_term>
</condition_browse>
<intervention_browse>
<!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm -->
<mesh_term>Nifedipine</mesh_term>
</intervention_browse>
<!-- Results have not yet been posted for this study -->
</clinical_study>
I am using the following code to pull the data into a pandas.DataFrame object:
import pandas as pd
file = 'NCT00000102.xml'
df = pd.read_xml(file)
However, when I print out the 'df' object, I get the following output:
download_date ... sharing_ipd
0 ClinicalTrials.gov processed this data on Sept... ... None
1 None ... None
2 None ... None
3 None ... None
4 None ... None
5 None ... None
6 None ... None
7 None ... None
8 None ... None
9 None ... None
10 None ... None
11 None ... None
12 None ... None
13 None ... None
14 None ... None
15 None ... None
16 None ... None
17 None ... None
18 None ... None
19 None ... None
20 None ... None
21 None ... None
22 None ... None
23 None ... None
24 None ... None
25 None ... None
26 None ... None
27 None ... None
28 None ... None
29 None ... None
30 None ... None
31 None ... None
32 None ... None
33 None ... None
34 None ... None
35 None ... None
36 None ... None
37 None ... None
38 None ... None
39 None ... None
40 None ... None
41 None ... None
42 None ... None
43 None ... None
44 None ... None
45 None ... None
46 None ... None
47 None ... None
48 None ... None
49 None ... None
50 None ... None
51 None ... None
52 None ... None
53 None ... None
54 None ... None
55 None ... None
56 None ... None
57 None ... No
[58 rows x 67 columns]
Ideally, the output should be a DataFrame that would print to something like this:
required_header.download_date | required_header.link_text | required_header.url | id_info.org_study_id | ... | |
---|---|---|---|---|---|
0 | ClinicalTrials.gov processed this data on September 19, 2022 | Link to the current ClinicalTrials.gov record. | https://clinicaltrials.gov/show/NCT00000102 | NCRR-M01RR01070-0506 | ... |
I exported the DataFrame into a csv to examine in Excel, and the result is essentially a diagonal matrix; i.e. in the first row, all the values are None
except for the first column; In the second row, all the values are None
except for the second colunm; and so forth.
I looked into the advice offered through various other posts such as 73218447. Although it lead me to more research on lxml and xpath and helped me learn a lot, it didn't show me how to import the entire XML data into a single row of a DataFrame object.
I think I'm supposed to spell out the data structure in the xpath
parameter of the read_xml
function, but I'm quite new to the use of lxml
and etree
and couldn't find a way of automating the generation of xpath
input even looping through the nodes from etree.parse(file)
.
Any advice is greatly appreciated.
Here is one to reshape your "diagonal" dataframe (26 rows x 36 columns) to a flat one of 2 rows x 33 columns, since some columns, like mesh_term
, have two values (rows) instead of one, and three columns are completely empty (thus removed):
df = pd.read_xml("NCT00000102.xml").dropna(how="all", axis=1)
df = pd.concat(
[df[col].dropna(how="any").reset_index(drop=True) for col in df.columns], axis=1
).fillna(method="ffill")
# Output
download_date link_text url org_study_id \
0 ClinicalTri... Link to the... https://cli... NCRR-M01RR0...
1 ClinicalTri... Link to the... https://cli... NCRR-M01RR0...
secondary_id ... study_first_posted last_update_submitted \
0 M01RR001070 ... November 4,... June 23, 2005
1 M01RR001070 ... November 4,... June 23, 2005
last_update_submitted_qc last_update_posted mesh_term
0 June 23, 2005 June 24, 2005 Hyperplasia
1 June 23, 2005 June 24, 2005 Nifedipine
[2 rows x 33 columns]