Search code examples
xmlrxml-parsinginfopathinfopath-forms-services

Import Infopath .XML forms into data frame in R


What's the best way to import Infopath .XML forms in R, and transform into a dataframe? If I open the Infopath .XML file in Excel, the rows and columns of the data frame appear correctly.

Here's what I tried in R using the XML package:

  1. Iused xmlParse() to parse the XML file
  2. I used xmlToDataFrame() to attempt to transform the parsed XML file to a data frame

In step 2, however, I receive the following error:

Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("touch your head13011000",  : 
  duplicate subscripts for columns

There do not appear to be duplicate columns, though, when I open the XML file in Excel. How can I transform this XML file from Infopath into a data frame in R? The expected columns should be (as they appear in Excel):

TCID, DateCoded, tcAge, T1_B3, T1_B2, T1_B1, T1_B0, T1_A3, T1_A2, T1_A1, T1_A0, T1_DelayTotal, T2_A3, T2_A2, T2_A1, T2_A, T2_B3, T2_B2, T2_B1, T2_B0, T2_DelayTotal, Coder, notes_t1, note_t2, bachildpres30, baparpres30, bapassptgo, bapassptnogo, bamissgame, P1_B3, P1_B2, P1_B1, P1_B0, P1_A3, P1_A2, P1_A1, P1_A0, P1_DelayTotal, P1_action, P1_go-nogo, P1_score, P1_delay, P1_trial, P1_Ecommand, P1_imitation, P1_restraint, P1_ruleswitch, P1_trials, P1_gotrials, P1_nogotrials, T1_gotrials, T1_nogotrials, T1_trials, T2_gotrials, T2_nogotrials, T2_trials, P1_notplay, T1_trial, T1_go-nogo, T1_score, T1_delay, T1_action, T2_trial, T2_go-nogo, T2_score, T2_delay, T2_action

For the variables that appear multiple times in the XML file, I'd like them to be in the data frame in long form (i.e., multiple rows for the same variable). I don't have much experience with XML files, so your help is greatly appreciated.

Below is what the parsed XML file looks like in R when I use xmlParse:

<my:myFields lang="en-us" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" xmlns:aml="http://schemas.microsoft.com/aml/2001/core" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:st="urn:schemas-microsoft-com:office:smarttags" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2009-07-01T18:12:59" xmlns:xd="http://schemas.microsoft.com/office/infopath/2003">
 <my:SPSS>
  <my:TCID>10</my:TCID>
  <my:DateCoded>2013-04-01</my:DateCoded>
  <my:tcAge>30</my:tcAge>
  <my:T1_B3>6</my:T1_B3>
  <my:T1_B2>0</my:T1_B2>
  <my:T1_B1>0</my:T1_B1>
  <my:T1_B0>0</my:T1_B0>
  <my:T1_A3>0</my:T1_A3>
  <my:T1_A2>0</my:T1_A2>
  <my:T1_A1>1</my:T1_A1>
  <my:T1_A0>5</my:T1_A0>
  <my:T1_DelayTotal>1</my:T1_DelayTotal>
  <my:T2_A3 nil="true"/>
  <my:T2_A2 nil="true"/>
  <my:T2_A1 nil="true"/>
  <my:T2_A0 nil="true"/>
  <my:T2_B3 nil="true"/>
  <my:T2_B2 nil="true"/>
  <my:T2_B1 nil="true"/>
  <my:T2_B0 nil="true"/>
  <my:T2_DelayTotal nil="true"/>
  <my:Coder>Name</my:Coder>
 </my:SPSS>
 <my:notes_t1/>
 <my:note_t2/>
 <my:bachildpres30>0</my:bachildpres30>
 <my:baparpres30>0</my:baparpres30>
 <my:bapassptgo>1</my:bapassptgo>
 <my:bapassptnogo>0</my:bapassptnogo>
 <my:bamissgame>0</my:bamissgame>
 <my:P1_B3>4</my:P1_B3>
 <my:P1_B2>0</my:P1_B2>
 <my:P1_B1>0</my:P1_B1>
 <my:P1_B0>1</my:P1_B0>
 <my:P1_A3>0</my:P1_A3>
 <my:P1_A2>0</my:P1_A2>
 <my:P1_A1>1</my:P1_A1>
 <my:P1_A0>3</my:P1_A0>
 <my:P1_DelayTotal>0</my:P1_DelayTotal>
 <my:group2>
  <my:group3>
   <my:P1_action>touch your head</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>3</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>1</my:P1_trial>
   <my:P1_Ecommand>1</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your nose</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>3</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>2</my:P1_trial>
   <my:P1_Ecommand>1</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your tummy</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>3</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>3</my:P1_trial>
   <my:P1_Ecommand>1</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your head</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>0</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>4</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your head</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>3</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>5</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your nose</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>3</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>6</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>clap your hands</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>3</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>7</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your nose</my:P1_action>
   <my:P1_go-nogo>0</my:P1_go-nogo>
   <my:P1_score>0</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>8</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your ears</my:P1_action>
   <my:P1_go-nogo>0</my:P1_go-nogo>
   <my:P1_score>0</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>9</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your tummy</my:P1_action>
   <my:P1_go-nogo>0</my:P1_go-nogo>
   <my:P1_score>0</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>10</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your eyes</my:P1_action>
   <my:P1_go-nogo>0</my:P1_go-nogo>
   <my:P1_score>1</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>11</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>1</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
  <my:group3>
   <my:P1_action>touch your eyes</my:P1_action>
   <my:P1_go-nogo>1</my:P1_go-nogo>
   <my:P1_score>3</my:P1_score>
   <my:P1_delay>0</my:P1_delay>
   <my:P1_trial>12</my:P1_trial>
   <my:P1_Ecommand>0</my:P1_Ecommand>
   <my:P1_imitation>0</my:P1_imitation>
   <my:P1_restraint>0</my:P1_restraint>
   <my:P1_ruleswitch>0</my:P1_ruleswitch>
  </my:group3>
 </my:group2>
 <my:P1_trials>9</my:P1_trials>
 <my:P1_gotrials>5</my:P1_gotrials>
 <my:P1_nogotrials>4</my:P1_nogotrials>
 <my:T1_gotrials>6</my:T1_gotrials>
 <my:T1_nogotrials>6</my:T1_nogotrials>
 <my:T1_trials>12</my:T1_trials>
 <my:T2_gotrials>0</my:T2_gotrials>
 <my:T2_nogotrials>0</my:T2_nogotrials>
 <my:T2_trials>0</my:T2_trials>
 <my:P1_notplay/>
 <my:group4>
  <my:group5>
   <my:T1_trial>1</my:T1_trial>
   <my:T1_go-nogo>1</my:T1_go-nogo>
   <my:T1_score>3</my:T1_score>
   <my:T1_delay>1</my:T1_delay>
   <my:T1_action>Touch your tongue</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>2</my:T1_trial>
   <my:T1_go-nogo>1</my:T1_go-nogo>
   <my:T1_score>3</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your teeth</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>3</my:T1_trial>
   <my:T1_go-nogo>0</my:T1_go-nogo>
   <my:T1_score>0</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your ear</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>4</my:T1_trial>
   <my:T1_go-nogo>1</my:T1_go-nogo>
   <my:T1_score>3</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Clap your hands</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>5</my:T1_trial>
   <my:T1_go-nogo>0</my:T1_go-nogo>
   <my:T1_score>0</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Clap your hands</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>6</my:T1_trial>
   <my:T1_go-nogo>0</my:T1_go-nogo>
   <my:T1_score>0</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your eyes</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>7</my:T1_trial>
   <my:T1_go-nogo>0</my:T1_go-nogo>
   <my:T1_score>0</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your feet</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>8</my:T1_trial>
   <my:T1_go-nogo>1</my:T1_go-nogo>
   <my:T1_score>3</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your nose</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>9</my:T1_trial>
   <my:T1_go-nogo>0</my:T1_go-nogo>
   <my:T1_score>1</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your nose</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>10</my:T1_trial>
   <my:T1_go-nogo>1</my:T1_go-nogo>
   <my:T1_score>3</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your tummy</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>11</my:T1_trial>
   <my:T1_go-nogo>0</my:T1_go-nogo>
   <my:T1_score>0</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Wave your hand</my:T1_action>
  </my:group5>
  <my:group5>
   <my:T1_trial>12</my:T1_trial>
   <my:T1_go-nogo>1</my:T1_go-nogo>
   <my:T1_score>3</my:T1_score>
   <my:T1_delay>0</my:T1_delay>
   <my:T1_action>Touch your head</my:T1_action>
  </my:group5>
 </my:group4>
 <my:group6>
  <my:group7>
   <my:T2_trial>1</my:T2_trial>
   <my:T2_go-nogo>0</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your tongue</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>2</my:T2_trial>
   <my:T2_go-nogo>0</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your teeth</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>3</my:T2_trial>
   <my:T2_go-nogo>1</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your ear</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>4</my:T2_trial>
   <my:T2_go-nogo>0</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Clap your hands</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>5</my:T2_trial>
   <my:T2_go-nogo>1</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Clap your hands</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>6</my:T2_trial>
   <my:T2_go-nogo>1</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your eyes</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>7</my:T2_trial>
   <my:T2_go-nogo>1</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your feet</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>8</my:T2_trial>
   <my:T2_go-nogo>0</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your nose</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>9</my:T2_trial>
   <my:T2_go-nogo>1</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your nose</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>10</my:T2_trial>
   <my:T2_go-nogo>0</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your tummy</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>11</my:T2_trial>
   <my:T2_go-nogo>1</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Wave your hand</my:T2_action>
  </my:group7>
  <my:group7>
   <my:T2_trial>12</my:T2_trial>
   <my:T2_go-nogo>0</my:T2_go-nogo>
   <my:T2_score/>
   <my:T2_delay>0</my:T2_delay>
   <my:T2_action>Touch your head</my:T2_action>
  </my:group7>
 </my:group6>
</my:myFields>

Solution

  • In my experience, xmlToDataFrame only works when the XML is already structured in a very consistent way. The data you are working with is structured in many different ways:

    # Assuming you've already read your data into a character vector called `xml_file    
    xml_file <- xmlParse(xml_file)
    xml_file <- xmlToList(xml_file)
    
    stack(sapply(xml_file, length))
       values           ind
    1      22          SPSS
    2       0      notes_t1
    3       0       note_t2
    4       1 bachildpres30
    5       1   baparpres30
    6       1    bapassptgo
    7       1  bapassptnogo
    8       1    bamissgame
    9       1         P1_B3
    10      1         P1_B2
    11      1         P1_B1
    12      1         P1_B0
    13      1         P1_A3
    14      1         P1_A2
    15      1         P1_A1
    16      1         P1_A0
    17      1 P1_DelayTotal
    18     12        group2
    19      1     P1_trials
    20      1   P1_gotrials
    21      1 P1_nogotrials
    22      1   T1_gotrials
    23      1 T1_nogotrials
    24      1     T1_trials
    25      1   T2_gotrials
    26      1 T2_nogotrials
    27      1     T2_trials
    28      0    P1_notplay
    29     12        group4
    30     12        group6
    31      1        .attrs
    

    So most of your nodes contain a single value. A few are empty. The "SPSS" node contains 22 values, all with different names, and "group2", "group4", and "group6" all contain 12 nodes, each of which contains multiple values, but whose values are similar across nodes. When I looked at what Excel did when I imported the file, it stacked the 12-node components on top of each other, then strung all 22 of the "SPSS" components together with all of the single-value nodes and repeated that string for as many rows as had been created from stacking the 12-node components, and then bound the two pieces together in columns.

    To do that, separate out the long string from the 12-node chunks:

    xml_file_singles <- xml_file[sapply(xml_file, length) != 12]
    xml_file_singles[sapply(xml_file_singles, length) == 0] <- NA
    xml_file_singles <- unlist(xml_file_singles)
    
    xml_file_multiples <- xml_file[sapply(xml_file, length) == 12]
    

    Now take the 12-node chunks and turn each chunk into a data frame:

    xml_file_multiples <- lapply(1:length(xml_file_multiples), function(i) {
    
      x <- lapply(xml_file_multiples[[i]], function(y) {
        data.frame(as.list(unlist(y)), stringsAsFactors = FALSE)})
      x <- do.call("rbind", x)
      cbind("group" = names(xml_file_multiples)[i], x)
    })
    

    Now use the plyr package's rbind.fill function to put all the new data frames together:

    require(plyr)
    
    xml_file_multiples <- do.call("rbind.fill", xml_file_multiples)
    

    Now cbind your long string of values to your bound dataframes:

    xml_final <- cbind(as.list(xml_file_singles), xml_file_multiples, 
      stringsAsFactors = FALSE)
    

    This approach, like Excel's, introduces a whole lot of NAs, because the column names for your different 12-node chunks all differ slightly. If you did this before calling rbind.fill:

    xml_file_multiples <- lapply(1:length(xml_file_multiples), function(i) {
    
      x <- lapply(xml_file_multiples[[i]], function(y) {
        data.frame(as.list(unlist(y)), stringsAsFactors = FALSE)})
      x <- do.call("rbind", x)
      x <- cbind("group" = names(xml_file_multiples)[i], x)
      colnames(x) <- gsub("^\\w\\d_", "", colnames(x))
      x
    })
    

    You would generate fewer NA's because you would generate fewer redundant columns, but then you would have to rely on the values in the "group" column to keep track of which rows had originally appeared in which node.