Search code examples
rxmldataframerenameanalysis

Extend Value in Dataframe by add Value of an other Cell and Row in R


I try to analyze XML-Data in R with dplyr and ggplot2. My code is able to transform the XML data into a data frame. Unfortunately the structure gets lost.

My XML-document have following structure by example:

<?xml version="1.0" encoding="UTF-8"?>
<Budget price="1234" items="1234" year="1990">
<Account name="a" value="123" step="0">
<Account name="1" value="12" step="1"/>
<Account name="1.1" value="12" step="2"/>
<Account name="2" value="12" step="1"/>
<Account name="2.1" value="9" step="2"/>
<Account name="2.2" value="3" step="2"/>
<Account name="3" value="99" step="1"/>
<Account name="3.1" value="78" step="2"/>
<Account name="3.1.1" value="70" step="3"/>
<Account name="3.1.2" value="8" step="3"/>
<Account name="3.2" value="21" step)="2"/>
</Account>
<Account name="b" value="234" step="0">
<Account name="1" value="200" step="1"/>

and so on

At first I save all values:

budget_values = xml_find_all(doc,"//Budget",flatten=FALSE)

Afterwards I select some of the values:

step_ids = purrr::map_chr(budget_values, ~xml_attr(.,"step"))
name_values = purrr::map_chr(budget_values, ~xml_attr(.,"name"))
values = purrr::map_chr(budget_values, ~xml_attr(.,"value"))

Save attributes in a combined list:

values_list <- list((step_ids),(name_values),(values))

And convert it into a data frame:

budget_df <- data.frame(sapply(values_list, c))

That works great. I got an DF like this:

Step-ID name vlaue
0 a 1234
1 1 12
2 1.1 12
1 2 12
2 2.1 9
2 2.2 3
1 3 99
0 b 234
1 1 200

and so on

As you see from the example some names are repeated - usually step 1 and 2; step 3 is usually very unique.

My aim is following dataframe to analyze the data more structured.

Step-ID name vlaue
0 a 1234
1 a1 12
2 a1.1 12
1 a2 12
2 a2.1 9
2 a2.2 3
1 a3 99
0 b 234
1 b1 200

and so on

For example: I want the values of all step1. Now I can't tell from which budget it is. With the new name I can see: this value is from budget a, this one from budget b and so on.

I tried following for-loop and stored the result in a new dataframe

df<-for (rows in budget_df) {
  if (rows$`Step-ID` == "0") {
    saved_name <- rows$name
    print(saved_name)
  }
  else
    (rows$`Step-ID` == "1"){
      rows$Haushalt+saved_name
      saved_names<-saved_name+rows$name
      print(saved_names)
    }
  else(rows$`Step-ID`=="2"){
    rows$Haushalt+saved_name
  }
  else(rows$`Step-ID`=="3"){
    rows$name+saved_names
  }
}
View(df)

And I get following Error:

Error: unexpected '{' in:
"  else
    (rows$`Step-ID` == "1"){"

My questions is: Is there a better way to analyze the data or rename the values in name?

Thank you very much for your help!

Update:

Thanks again to @jpsmith. I tried following code regarding to his recommondation:

df-budget_df

    budget <- ""
    df <- for (row in df) {
      mutate(
        case_when (
          df$`Step-ID` == "0" ~ budget <- df$Haushalt,
          df$`Step-ID` == "2" ~ mutate(df, sturucture = paste(budget, df$Haushalt)),
          df$`Step-ID` == "2" ~ budget <- c(budget, df$Haushalt),
          df$`Step-ID` == "3" ~ mutate(df, sturucture = paste(values, df$Haushalt))
        )
      )
    }

Explains logically, what I want to do, but doesn't work. I think, it's because of trying to store the value with <-? I couldn't find another way at ?case_when to store values. Another code (I have overwritten) stores the value of Step-ID and extended the value of Haushalt of the same step, instead of: Step-ID 0 to Haushalt with Step-ID 1 under Step-ID 0 and Step-ID 1 to Haushalt with Step-ID 2.


Solution

  • Consider XSLT, the special-purpose language designed to transform XML files, in order to prefix the parent @name attribute to underlying child @name attributes. With R's xslt (complementary package to xml2) you can run XSLT 1.0 scripts:

    XSLT (save as .xsl file, a special .xml file)

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
        <xsl:strip-space elements="*"/>
    
        <xsl:template match="/Budget">
         <xsl:copy>
           <xsl:apply-templates select="Account"/>
         </xsl:copy>
        </xsl:template>
        
        <!-- MOVE ATTRIBUTES TO ELEMENTS -->
        <xsl:template match="Account">
         <xsl:copy>
           <stepid><xsl:value-of select="@step"/></stepid>
           <name><xsl:value-of select="@name"/></name>
           <value><xsl:value-of select="@value"/></value>
         </xsl:copy>
         <xsl:apply-templates select="*"/>
        </xsl:template>
        
        <!-- MOVE ATTRIBUTES TO ELEMENTS AND CONCATENATE PARENT @name ATTRIBUTE -->
        <xsl:template match="Account/*">
         <xsl:variable name="step">
           <xsl:value-of select="../@name"/>
         </xsl:variable>
         <xsl:copy>
           <stepid><xsl:value-of select="@step"/></stepid>
           <name><xsl:value-of select="concat($step, @name)"/></name>
           <value><xsl:value-of select="@value"/></value>
         </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>
    

    Online Demo

    R

    library(xml2)
    library(xslt)
    
    # LOAD XML AND XSLT
    doc <- read_xml("inputF.xml")
    style <- read_xml("style.xsl", package = "xslt")
    
    # RUN TRANSFORMATION AND SEE OUTPUT
    new_xml <- xml_xslt(doc, style)
    
    # RETRIEVE ALL NODES
    recs <- xml2::xml_find_all(new_xml, "//Account")
    
    # BIND EACH CHILD TEXT AND NAME
    df_list <- lapply(recs, function(r) {
      vals <- xml2::xml_children(r)
      
      df <- setNames(
        c(xml2::xml_text(vals)), 
        c(xml2::xml_name(vals))
      ) |> rbind() |> data.frame()
    })
    
    # COMBINE ALL DFS
    accounts_df <- do.call(rbind.data.frame, df_list)
    

    Output

    accounts_df
    
    #    stepid   name value
    # 1       0      a   123
    # 2       1     a1    12
    # 3       2   a1.1    12
    # 4       1     a2    12
    # 5       2   a2.1     9
    # 6       2   a2.2     3
    # 7       1     a3    99
    # 8       2   a3.1    78
    # 9       3 a3.1.1    70
    # 10      3 a3.1.2     8
    # 11      2   a3.2    21
    # 12      0      b   234
    # 13      1     b1   200