Search code examples
rxmltidyversexml2

Parse xml to dataframe including children and attributes in R


I am trying to create a dataframe from attached xml https://1drv.ms/u/s!Am7buNMZi-gwgeBmbk6A-NRIRarjYw?e=Pcgm7c

I need to get for all players their column info and info about Team (Parent)

XML sample

<SoccerFeed timestamp="20190519T183022+0000">
  <SoccerDocument Type="SQUADS Latest" competition_code="ES_PL" competition_id="23" competition_name="Spanish La Liga" season_id="2018" season_name="Season 2018/2019">
    <Team country="Spain" country_id="4" country_iso="ES" official_club_name="Deportivo Alavés S.A.D." region_id="17" region_name="Europe" short_club_name="Alavés" uID="t173">
      <Founded>1921</Founded>
      <Name>Alavés</Name>
      <Player uID="p91406">
        <Name>Fernando Pacheco</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Fernando</Stat>
        <Stat Type="last_name">Pacheco</Stat>
        <Stat Type="birth_date">1992-05-18</Stat>
        <Stat Type="birth_place">Badajoz</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Left</Stat>
        <Stat Type="weight">81</Stat>
        <Stat Type="height">186</Stat>
        <Stat Type="jersey_num">1</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2015-08-07</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
      <Player uID="p176245">
        <Name>Antonio Sivera</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Antonio</Stat>
        <Stat Type="last_name">Sivera</Stat>
        <Stat Type="birth_date">1996-08-11</Stat>
        <Stat Type="birth_place">Jávea</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Right</Stat>
        <Stat Type="weight">75</Stat>
        <Stat Type="height">184</Stat>
        <Stat Type="jersey_num">13</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2017-07-19</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
     </Team>
     <Team city="Madrid" country="Spain" country_id="4" country_iso="ES" official_club_name="Club Atlético de Madrid S.A.D" postal_code="28005" region_id="17" region_name="Europe" short_club_name="Atlético" street="Paseo Virgen del Puerto, 67" uID="t175" web_address="www.clubatleticodemadrid.com/">
      <Founded>1903</Founded>
      <Name>Atlético de Madrid</Name>
      <Player uID="p59981">
        <Name>Antonio Adán</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Antonio</Stat>
        <Stat Type="last_name">Adán</Stat>
        <Stat Type="birth_date">1987-05-13</Stat>
        <Stat Type="birth_place">Madrid</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Left</Stat>
        <Stat Type="weight">92</Stat>
        <Stat Type="height">190</Stat>
        <Stat Type="jersey_num">1</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2018-07-10</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
      <Player uID="p81352">
        <Name>Jan Oblak</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Jan</Stat>
        <Stat Type="last_name">Oblak</Stat>
        <Stat Type="birth_date">1993-01-07</Stat>
        <Stat Type="birth_place">Skojfa Loka</Stat>
        <Stat Type="first_nationality">Slovenia</Stat>
        <Stat Type="preferred_foot">Right</Stat>
        <Stat Type="weight">87</Stat>
        <Stat Type="height">188</Stat>
        <Stat Type="jersey_num">13</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2014-07-16</Stat>
        <Stat Type="country">Slovenia</Stat>
      </Player>
     </Team>
   </SoccerDocument>
</SoccerFeed>



My desired columns

TEAM COLUMNS

  • country (SoccerFeed/SoccerDocument/Team attribute)
  • country_id (SoccerFeed/SoccerDocument/Team attribute)
  • country_iso (SoccerFeed/SoccerDocument/Team attribute)
  • official_club_name (SoccerFeed/SoccerDocument/Team attribute)
  • region_id (SoccerFeed/SoccerDocument/Team attribute)
  • region_name (SoccerFeed/SoccerDocument/Team attribute)
  • short_club_name (SoccerFeed/SoccerDocument/Team attribute)
  • team_uID (SoccerFeed/SoccerDocument/Team attribute uID)
  • team_name (SoccerFeed/SoccerDocument/Team/Name)
  • team_founded (SoccerFeed/SoccerDocument/Team/Founded)

PLAYER columns

  • player_uID (/SoccerFeed/SoccerDocument/Team/Player)

  • player_name (/SoccerFeed/SoccerDocument/Team/Player/Name)

  • player_position (/SoccerFeed/SoccerDocument/Team/Player/Position)

  • player_first_name (/SoccerFeed/SoccerDocument/Team/Player/Stat type = first name)
  • player_last_name (/SoccerFeed/SoccerDocument/Team/Player/Stat type = last name)
  • player_first_name (/SoccerFeed/SoccerDocument/Team/Player/Stat type = first name)
  • player_birth_place (/SoccerFeed/SoccerDocument/Team/Player/Stat type = birth place)
  • player_preferred_foot (/SoccerFeed/SoccerDocument/Team/Player/Stat type = preferred_foot) ... other players stats (weight, height, jersey_num,...country)

I am not interested in Teams and Player nodes under Player changes below section /SoccerFeed/SoccerDocument/PlayerChanges

I started with tidyverse and xml2 to gather player info in combination with tidyverse but i was unable to get team parent info and different stats for players


library(xml2)
library(tidyverse)
library(plyr)



x <- read_xml("squads.xml")

players <- x %>% 
  xml_find_all('/SoccerFeed/SoccerDocument/Team/Player') %>% 
  map_df(~flatten(c(xml_attrs(.x), 
                    map(xml_children(.x), 
                        ~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
  type_convert()


Solution

  • Since you use xml2 and require various data nodes that differ across nested levels, consider XSLT, the special-purpose language (like SQL) designed to transform XML files. In R, the xslt package, sister module to xml2, can run XSLT 1.0 scripts. The recursive, template nature of XSLT helps avoid complex nested loops or mapping at application layer, here being R. Plus XSLT is portable (like SQL) and can be run outside of R.

    While this may be a whole new concept out of left field requiring a learning curve, it cleanly flattens your XML to the 2-D structure needed for data sets. You also separate XML handling (XSLT) from data handling (R). Specifically, only Player level is retained with respective Team data migrated down (see demo).

    XSLT (save as .xsl, a special .xml file)

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output indent="yes"/>
      <xsl:strip-space elements="*"/>
    
      <xsl:template match="/SoccerFeed|SoccerDocument">
          <xsl:apply-templates select="*"/>
      </xsl:template>
    
      <xsl:template match="Team">
          <xsl:apply-templates select="Player"/>
      </xsl:template>
    
      <xsl:template match="Team/@*">
        <xsl:element name="{concat('team_', name(.))}">
          <xsl:value-of select="."/>      
        </xsl:element>
      </xsl:template>
    
      <xsl:template match="Player">
        <xsl:copy>
          <xsl:apply-templates select="ancestor::Team/@*"/>
          <xsl:copy-of select="Name|Position"/>
          <xsl:apply-templates select="@*|Stat"/>
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="Player/@*">
        <xsl:element name="{name(.)}">
          <xsl:value-of select="."/>      
        </xsl:element>
      </xsl:template>
    
      <xsl:template match="Stat">
        <xsl:element name="{@Type}">
          <xsl:value-of select="text()"/>     
        </xsl:element>
      </xsl:template>
    </xsl:stylesheet>
    

    Online Demo

    R (results in data frame of all character types)

    library(xml2)
    library(xslt)
    library(dplyr)
    
    # INPUT SOURCE
    doc <- read_xml("/path/to/Input.xml")
    style <- read_xml("/path/to/Style.xsl", package = "xslt")
    
    # TRANSFORM 
    new_xml <- xml_xslt(doc, style)
    
    # RETRIEVE Player NODES
    recs <- xml_find_all(new_xml, "//Player")
    
    # BIND EACH CHILD TEXT AND NAME TO Player DFs
    df_list <- lapply(recs, function(r) 
        data.frame(rbind(setNames(xml_text(xml_children(r)), 
                                  xml_name(xml_children(r)))),
                   stringsAsFactors = FALSE)
    )
    
    # BIND ALL DFs TO SINGLE MASTER DF
    final_df <- bind_rows(df_list)