Search code examples
xmlrodk

R: Reading joining XML data instances (ODK)


I am using OpenDataKit's ODK collect to collect survey data in the field. Currently I am using the ODK aggregate accept data submissions on a google cloud before being downloaded as CSV files. This entire process is somewhat frustrating because every step is prone of potential errors. I would like to instead be able to read the data from the tablets directly into R and compile dataframes for each level of the data.

The data is saved as individual instances in xml format. Right now we have something like 2000 different instances. When reading an individual instance into R with XML the data ends up looking in the following manner:

  <A_note/>
  <A_group1>
    <A_note1/>
    <A_note2/>
    <A01>2</A01>
  </A_group1>
  <A_group1.5>
    <A02>901</A02>
    <A02a/>
  </A_group1.5>
  <A_group2>
    <A03>9</A03>
    <A03a/>
    <HH_key>9010</HH_key>
    <A04a/>
    <A06/>
    <A07/>
  </A_group2>
  <A_group3>
    <A04>9</A04>
    <A04a_note/>
    <A06_note/>
    <A07_note/>
    <A04a_int>840256790</A04a_int>
    <A05>2</A05>
    <A06a>Baixo Umbeluze, perto do rio Umbeluze.</A06a>
    <A07a>-26.057376459502194 32.33107993182396 15.271170877998825 4.0</A07a>

We can see that there are a lot of tags which don't have any information (for example A_note1 and A_note2) as well as groups which are unnecessary because the level above them are unique (A_group1 and A_group2).

What I would like to be able to do is: 1. flatten the data by removing unnecessary groups 2. treat each instance as a different row of data and stack the information from my instances together.

I know this is probably too much to ask on a single post but I wanted to put this out there in case someone has already put in the hard work to figure out how to make this work.

Thanks, Francis


Solution

  • I know this is 4 years late...

    ruODK tackles exactly that problem. XML's complexity of names and namespaces and attributes translates to a nested list in R.

    Judging by the age of your question, you must have been using ODK Aggregate, which is being replaced by ODK Central. ODK Central implements Aggregate's OpenRosa API, plus a RESTful API, plus OData API endpoints. Side note: fantastic interactive API docs are here - JavaRosa endpoints should also work for ODK Aggregate.

    To figure out how to un-nest your XML / nested list in R, you could:

    Note that the tidyr functions used by ruODK have been implemented about four years after your question, and ruODK is built on top of them.

    Hope this helps!

    Edit to HT @muntashir-al-arefin who authored the R package "odk". His package is compared to other similar packages in the ruODK README.