Search code examples
rxmldataframetidyverseunnest

How to unnest a dictionary from XML in R?


I am attempting to convert this xml to a dataframe in R:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

library(xml2)
library(tidyverse)

fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmllist <- as_list(read_xml(fileurl)) 
xml_df = tibble::as_tibble(xmllist) %>%
        unnest_longer(response)

row_wider = xml_df %>%
        unnest_wider(response)

row_df = row_wider %>%
        unnest(cols = names(.)) %>%
        unnest(cols = names(.)) %>%
        readr::type_convert()

The issue is that the 'location_1' column is a dictionary and shows up as NA when I unnest. How can I get each of the values of this dictionary into this column? Any help is much appreciated, thanks.


Solution

  • The location_1 column is an empty list (hence you get NAs) with two attributes human_address which is a JSON string and a logical needs_recoding. One option to get your desired result would be to first extract the content of these attributes and store them in a list. Afterwards you could use two unnest_wider to unnest the list column.

    library(xml2)
    library(tidyverse)
    
    parse_location_1 <- function(x) {
      x$location_1 <- list(
        human_address = jsonlite::fromJSON(attr(x$location_1, "human_address")),
        needs_recoding = attr(x$location_1, "needs_recoding")
      )
      x
    }
    
    fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
    xmllist <- as_list(read_xml(fileurl))
    xml_df <- tibble::as_tibble(xmllist) %>%
      unnest_longer(response) |>
      mutate(response = map(
        response, parse_location_1
      ))
    
    row_wider <- xml_df %>%
      unnest_wider(response) |>
      unnest_wider(location_1) |> 
      unnest_wider(human_address)
    
    row_df <- row_wider %>%
      unnest(cols = where(is.list)) %>%
      unnest(cols = where(is.list)) %>%
      readr::type_convert()
    #> 
    #> ── Column specification ────────────────────────────────────────────────────────
    #> cols(
    #>   name = col_character(),
    #>   zipcode = col_double(),
    #>   neighborhood = col_character(),
    #>   councildistrict = col_double(),
    #>   policedistrict = col_character(),
    #>   address = col_character(),
    #>   city = col_character(),
    #>   state = col_character(),
    #>   zip = col_logical(),
    #>   needs_recoding = col_logical(),
    #>   response_id = col_character()
    #> )
    
    head(row_df)
    #> # A tibble: 6 × 11
    #>   name   zipcode neighborhood councildistrict policedistrict address city  state
    #>   <chr>    <dbl> <chr>                  <dbl> <chr>          <chr>   <chr> <chr>
    #> 1 410      21206 Frankford                  2 NORTHEASTERN   4509 B… Balt… MD   
    #> 2 1919     21231 Fells Point                1 SOUTHEASTERN   1919 F… Balt… MD   
    #> 3 SAUTE    21224 Canton                     1 SOUTHEASTERN   2844 H… Balt… MD   
    #> 4 #1 CH…   21211 Hampden                   14 NORTHERN       3998 R… Balt… MD   
    #> 5 #1 ch…   21223 Millhill                   9 SOUTHWESTERN   2481 f… Balt… MD   
    #> 6 19TH …   21218 Clifton Park              14 NORTHEASTERN   2722 H… Balt… MD   
    #> # ℹ 3 more variables: zip <lgl>, needs_recoding <lgl>, response_id <chr>