Search code examples
python-3.xpandasxmldataframediagonal

How to flatten a diagonal dataframe after reading an XML file?


I have a series of a large XML files that I would like to import into a pandas DataFrame. All the XML data files have the same structure. Here is one of those files as a sample:

<clinical_study>
  <!-- This xml conforms to an XML Schema at:
    https://clinicaltrials.gov/ct2/html/images/info/public.xsd -->
  <required_header>
    <download_date>ClinicalTrials.gov processed this data on September 19, 2022</download_date>
    <link_text>Link to the current ClinicalTrials.gov record.</link_text>
    <url>https://clinicaltrials.gov/show/NCT00000102</url>
  </required_header>
  <id_info>
    <org_study_id>NCRR-M01RR01070-0506</org_study_id>
    <secondary_id>M01RR001070</secondary_id>
    <nct_id>NCT00000102</nct_id>
  </id_info>
  <brief_title>Congenital Adrenal Hyperplasia: Calcium Channels as Therapeutic Targets</brief_title>
  <sponsors>
    <lead_sponsor>
      <agency>National Center for Research Resources (NCRR)</agency>
      <agency_class>NIH</agency_class>
    </lead_sponsor>
  </sponsors>
  <source>National Center for Research Resources (NCRR)</source>
  <brief_summary>
    <textblock>
      This study will test the ability of extended release nifedipine (Procardia XL), a blood&#xD;
      pressure medication, to permit a decrease in the dose of glucocorticoid medication children&#xD;
      take to treat congenital adrenal hyperplasia (CAH).&#xD;
    </textblock>
  </brief_summary>
  <detailed_description>
    <textblock>
      This protocol is designed to assess both acute and chronic effects of the calcium channel&#xD;
      antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with&#xD;
      congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will&#xD;
      involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine&#xD;
      the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels,&#xD;
      as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II&#xD;
      is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release&#xD;
      by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA&#xD;
      axis? Such a decrease would, in turn, reduce the deleterious effects of glucocorticoid&#xD;
      treatment in CAH.&#xD;
    </textblock>
  </detailed_description>
  <overall_status>Completed</overall_status>
  <phase>Phase 1/Phase 2</phase>
  <study_type>Interventional</study_type>
  <has_expanded_access>No</has_expanded_access>
  <study_design_info>
    <intervention_model>Parallel Assignment</intervention_model>
    <primary_purpose>Treatment</primary_purpose>
    <masking>Double</masking>
  </study_design_info>
  <condition>Congenital Adrenal Hyperplasia</condition>
  <intervention>
    <intervention_type>Drug</intervention_type>
    <intervention_name>Nifedipine</intervention_name>
  </intervention>
  <eligibility>
    <criteria>
      <textblock>
        Inclusion Criteria:&#xD;
&#xD;
          -  diagnosed with Congenital Adrenal Hyperplasia (CAH)&#xD;
&#xD;
          -  normal ECG during baseline evaluation&#xD;
&#xD;
        Exclusion Criteria:&#xD;
&#xD;
          -  history of liver disease, or elevated liver function tests&#xD;
&#xD;
          -  history of cardiovascular disease&#xD;
      </textblock>
    </criteria>
    <gender>All</gender>
    <minimum_age>14 Years</minimum_age>
    <maximum_age>35 Years</maximum_age>
    <healthy_volunteers>No</healthy_volunteers>
  </eligibility>
  <location>
    <facility>
      <name>Medical University of South Carolina</name>
      <address>
        <city>Charleston</city>
        <state>South Carolina</state>
        <country>United States</country>
      </address>
    </facility>
  </location>
  <location_countries>
    <country>United States</country>
  </location_countries>
  <verification_date>January 2004</verification_date>
  <study_first_submitted>November 3, 1999</study_first_submitted>
  <study_first_submitted_qc>November 3, 1999</study_first_submitted_qc>
  <study_first_posted type="Estimate">November 4, 1999</study_first_posted>
  <last_update_submitted>June 23, 2005</last_update_submitted>
  <last_update_submitted_qc>June 23, 2005</last_update_submitted_qc>
  <last_update_posted type="Estimate">June 24, 2005</last_update_posted>
  <condition_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm            -->
    <mesh_term>Adrenal Hyperplasia, Congenital</mesh_term>
    <mesh_term>Adrenogenital Syndrome</mesh_term>
    <mesh_term>Adrenocortical Hyperfunction</mesh_term>
    <mesh_term>Hyperplasia</mesh_term>
  </condition_browse>
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm            -->
    <mesh_term>Nifedipine</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been posted for this study                                          -->
</clinical_study>

I am using the following code to pull the data into a pandas.DataFrame object:

import pandas as pd
file = 'NCT00000102.xml'
df = pd.read_xml(file)

However, when I print out the 'df' object, I get the following output:

                                        download_date  ... sharing_ipd
0   ClinicalTrials.gov processed this data on Sept...  ...        None
1                                                None  ...        None
2                                                None  ...        None
3                                                None  ...        None
4                                                None  ...        None
5                                                None  ...        None
6                                                None  ...        None
7                                                None  ...        None
8                                                None  ...        None
9                                                None  ...        None
10                                               None  ...        None
11                                               None  ...        None
12                                               None  ...        None
13                                               None  ...        None
14                                               None  ...        None
15                                               None  ...        None
16                                               None  ...        None
17                                               None  ...        None
18                                               None  ...        None
19                                               None  ...        None
20                                               None  ...        None
21                                               None  ...        None
22                                               None  ...        None
23                                               None  ...        None
24                                               None  ...        None
25                                               None  ...        None
26                                               None  ...        None
27                                               None  ...        None
28                                               None  ...        None
29                                               None  ...        None
30                                               None  ...        None
31                                               None  ...        None
32                                               None  ...        None
33                                               None  ...        None
34                                               None  ...        None
35                                               None  ...        None
36                                               None  ...        None
37                                               None  ...        None
38                                               None  ...        None
39                                               None  ...        None
40                                               None  ...        None
41                                               None  ...        None
42                                               None  ...        None
43                                               None  ...        None
44                                               None  ...        None
45                                               None  ...        None
46                                               None  ...        None
47                                               None  ...        None
48                                               None  ...        None
49                                               None  ...        None
50                                               None  ...        None
51                                               None  ...        None
52                                               None  ...        None
53                                               None  ...        None
54                                               None  ...        None
55                                               None  ...        None
56                                               None  ...        None
57                                               None  ...          No

[58 rows x 67 columns]

Ideally, the output should be a DataFrame that would print to something like this:

required_header.download_date required_header.link_text required_header.url id_info.org_study_id ...
0 ClinicalTrials.gov processed this data on September 19, 2022 Link to the current ClinicalTrials.gov record. https://clinicaltrials.gov/show/NCT00000102 NCRR-M01RR01070-0506 ...

I exported the DataFrame into a csv to examine in Excel, and the result is essentially a diagonal matrix; i.e. in the first row, all the values are None except for the first column; In the second row, all the values are None except for the second colunm; and so forth.

I looked into the advice offered through various other posts such as 73218447. Although it lead me to more research on lxml and xpath and helped me learn a lot, it didn't show me how to import the entire XML data into a single row of a DataFrame object.

I think I'm supposed to spell out the data structure in the xpath parameter of the read_xml function, but I'm quite new to the use of lxml and etree and couldn't find a way of automating the generation of xpath input even looping through the nodes from etree.parse(file).

Any advice is greatly appreciated.


Solution

  • Here is one to reshape your "diagonal" dataframe (26 rows x 36 columns) to a flat one of 2 rows x 33 columns, since some columns, like mesh_term, have two values (rows) instead of one, and three columns are completely empty (thus removed):

    df = pd.read_xml("NCT00000102.xml").dropna(how="all", axis=1)
    
    df = pd.concat(
        [df[col].dropna(how="any").reset_index(drop=True) for col in df.columns], axis=1
    ).fillna(method="ffill")
    
    # Output
        download_date       link_text             url    org_study_id  \
    0  ClinicalTri...  Link to the...  https://cli...  NCRR-M01RR0...   
    1  ClinicalTri...  Link to the...  https://cli...  NCRR-M01RR0...   
    
      secondary_id  ... study_first_posted last_update_submitted  \
    0  M01RR001070  ...  November 4,...      June 23, 2005
    1  M01RR001070  ...  November 4,...      June 23, 2005
    
      last_update_submitted_qc last_update_posted    mesh_term  
    0   June 23, 2005            June 24, 2005     Hyperplasia  
    1   June 23, 2005            June 24, 2005      Nifedipine  
    
    [2 rows x 33 columns]