Search code examples
pythonpandasxmljson-normalize

Flatten XML data as a pandas dataframe


How can I convert this XML file at this address into a pandas dataframe? I have downloaded the XML as a file and called it '058com.xml' and run the code below, though the last column of the resulting dataframe is a mess of data arranged as multiple OrderedDict. The XML structure seems complex and is beyond my knowledge.

json_normalize documentation left me confused. How can I improve the code to fully flatten the XML ?

import pandas as pd
import xmltodict

rawdata = '058com.xml'

with open(rawdata) as fd:
    doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)

pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])

Ideally the dataframe should look like ID's, names for geographic entities and vote results and names of election candidates.

The final dataframe should contain a lot of columns when fully flatten and is expected to be very close of the CSV below. I pasted the headers and the first line in the form of a .csv (semi-colon separated) as a resentative sample of what the dataframe should look like

Code du département;Libellé du département;Code de la commune;Libellé de la commune;Etat saisie;Inscrits;Abstentions;% Abs/Ins;Votants;% Vot/Ins;Blancs;% Blancs/Ins;% Blancs/Vot;Nuls;% Nuls/Ins;% Nuls/Vot;Exprimés;% Exp/Ins;% Exp/Vot;N°Panneau;Sexe;Nom;Prénom;Voix;% Voix/Ins;% Voix/Exp
01;Ain;001;L'Abergement-Clémenciat;Complet;645;108;16,74;537;83,26;16;2,48;2,98;1;0,16;0,19;520;80,62;96,83;1;F;ARTHAUD;Nathalie;3;0,47;0,58;2;M;ROUSSEL;Fabien;6;0,93;1,15;3;M;MACRON;Emmanuel;150;23,26;28,85;4;M;LASSALLE;Jean;18;2,79;3,46;5;F;LE PEN;Marine;149;23,10;28,65;6;M;ZEMMOUR;Éric;43;6,67;8,27;7;M;MÉLENCHON;Jean-Luc;66;10,23;12,69;8;F;HIDALGO;Anne;5;0,78;0,96;9;M;JADOT;Yannick;30;4,65;5,77;10;F;PÉCRESSE;Valérie;26;4,03;5,00;11;M;POUTOU;Philippe;3;0,47;0,58;12;M;DUPONT-AIGNAN;Nicolas;21;3,26;4,04

Solution

  • Since the URL really contains two data sections under each <Tour>, specifically <Mentions> (which appear to be aggregate vote data) and <Candidats> (which are granular person-level data) (pardon my French), consider building two separate data frames using the new IO method, pandas.read_xml, which supports XSLT 1.0 (via the third-party lxml package). No migration to dictionaries for JSON handling.

    As a special purpose language written in XML, XSLT can transform your nested structure to flatter format for migration to data frame. Specifically, each stylesheet drills down to the most granular node and then by the ancestor axis pulls higher level information as sibling columns.

    Mentions (save as .xsl, a special .xml file or embed as string in Python)

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output indent="yes"/>
      <xsl:strip-space elements="*"/>
      
      <xsl:template match="/">
        <Tours>
          <xsl:apply-templates select="descendant::Tour/Mentions"/>
        </Tours>
      </xsl:template>
      
      <xsl:template match="Mentions/*">
        <Mention>
          <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
          <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
          <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
          <xsl:copy-of select="ancestor::Tour/NumTour"/>
          <Mention><xsl:value-of select="name()"/></Mention>
          <xsl:copy-of select="*"/>
        </Mention>
      </xsl:template>
      
    </xsl:stylesheet>
    

    Python (read directly from URL)

    url = (
        "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
        "PR2022/resultatsT1/027/058/058com.xml"
    )
    
    mentions_df = pd.read_xml(url, stylesheet=mentions_xsl)
    

    Output

                    Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour      Mention  Nombre RapportInscrit RapportVotant
    0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1     Inscrits     105           None          None
    1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1  Abstentions      24          22,86          None
    2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1      Votants      81          77,14          None
    3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1       Blancs       2           1,90          2,47
    4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1         Nuls       0           0,00          0,00
                 ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...          ...     ...            ...           ...
    1849  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1  Abstentions      13          14,94          None
    1850  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1      Votants      74          85,06          None
    1851  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1       Blancs       1           1,15          1,35
    1852  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1         Nuls       0           0,00          0,00
    1853  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1     Exprimes      73          83,91         98,65
    
    [1854 rows x 16 columns]
    

    Candidats (save as .xsl, a special .xml file or embed as string in Python)

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output indent="yes"/>
      <xsl:strip-space elements="*"/>
      
      <xsl:template match="/">
        <Candidats>
          <xsl:apply-templates select="descendant::Tour/Resultats/Candidats"/>
        </Candidats>
      </xsl:template>
      
      <xsl:template match="Candidat">
        <xsl:copy>
          <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
          <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
          <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
          <xsl:copy-of select="ancestor::Tour/NumTour"/>
          <xsl:copy-of select="*"/>
        </xsl:copy>
      </xsl:template>
      
    </xsl:stylesheet>
    

    Python (read directly from URL)

    url = (
        "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
        "PR2022/resultatsT1/027/058/058com.xml"
    )
    
    candidats_df = pd.read_xml(url, stylesheet=candidats_xsl)
    

    Output

                    Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour  NumPanneauCand         NomPsn PrenomPsn CivilitePsn  NbVoix RapportExprime RapportInscrit
    0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               1        ARTHAUD  Nathalie         Mme       0           0,00           0,00
    1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               2        ROUSSEL    Fabien          M.       3           3,80           2,86
    2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               3         MACRON  Emmanuel          M.      14          17,72          13,33
    3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               4       LASSALLE      Jean          M.       2           2,53           1,90
    4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               5         LE PEN    Marine         Mme      28          35,44          26,67
                 ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...             ...            ...       ...         ...     ...            ...            ...
    3703  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               8        HIDALGO      Anne         Mme       0           0,00           0,00
    3704  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               9          JADOT   Yannick          M.       4           5,48           4,60
    3705  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              10       PÉCRESSE   Valérie         Mme       6           8,22           6,90
    3706  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              11         POUTOU  Philippe          M.       1           1,37           1,15
    3707  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              12  DUPONT-AIGNAN   Nicolas          M.       4           5,48           4,60
    
    [3708 rows x 19 columns]
    

    You can join resulting data frames using their shared Communes nodes: <CodSubCom> and <LibSubCom> but may have to pivot_table on the aggregate data for a one-to-many merge. Below demonstrates with Nombre aggregate:

    mentions_candidats_df = (
        candidats_df.merge(
            mentions_df.pivot_table(
                index=["CodSubCom", "LibSubCom"],
                columns="Mention",
                values="Nombre",
                aggfunc="max"
            ).reset_index(),
            on=["CodSubCom", "LibSubCom"]
        )
    )
    
    mentions_candidats_df.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 3708 entries, 0 to 3707
    Data columns (total 25 columns):
     #   Column          Non-Null Count  Dtype 
    ---  ------          --------------  ----- 
     0   Type            3708 non-null   object
     1   Annee           3708 non-null   int64 
     2   CodReg          3708 non-null   int64 
     3   CodReg3Car      3708 non-null   int64 
     4   LibReg          3708 non-null   object
     5   CodDpt          3708 non-null   int64 
     6   CodMinDpt       3708 non-null   int64 
     7   CodDpt3Car      3708 non-null   int64 
     8   LibDpt          3708 non-null   object
     9   CodSubCom       3708 non-null   int64 
     10  LibSubCom       3708 non-null   object
     11  NumTour         3708 non-null   int64 
     12  NumPanneauCand  3708 non-null   int64 
     13  NomPsn          3708 non-null   object
     14  PrenomPsn       3708 non-null   object
     15  CivilitePsn     3708 non-null   object
     16  NbVoix          3708 non-null   int64 
     17  RapportExprime  3708 non-null   object
     18  RapportInscrit  3708 non-null   object
     19  Abstentions     3708 non-null   int64 
     20  Blancs          3708 non-null   int64 
     21  Exprimes        3708 non-null   int64 
     22  Inscrits        3708 non-null   int64 
     23  Nuls            3708 non-null   int64 
     24  Votants         3708 non-null   int64 
    dtypes: int64(16), object(9)
    memory usage: 753.2+ KB
    

    In forthcoming pandas 1.5, read_xml will support dtypes to allow conversion after XSLT transformation in this case.