Search code examples
rjsonapache-arrow

Is there a way to read a nested column?


I have a bunch of Newline-delimited JSON files that I want to read into R using the arrow package.

One of the parameters in the file is nested. The potential nested values are quite big and messy and I would prefer to only select the nested parameters I actually need.

Here is an example of the data I am working with:

# Bring in libraries
suppressMessages(library(arrow))
suppressMessages(library(data.table))

# Make data
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
    { "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
    { "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
    { "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
  ', tf, useBytes = TRUE)
df <- read_json_arrow(tf)

This is the result of what I just read in:

enter image description here

read_json_arrow(tf, col_select = "yo")

I can also read in the "yo" column. The results is below:

enter image description here

But I am having trouble reading in the "yo.param1" data element:

enter image description here

Any ideas on how I might read this nested column in and avoid reading in the entire column?


Solution

  • When you read in an object using the read_* functions, you are reading them in as Arrow Tables, which are stored in-memory. Arrow is designed around doing zero-copy operations, which means that if you can manipulate Arrow objects directly instead of pulling them into R, this should help with not creating intermediate copies of objects and blowing up your R session, when working with larger objects.

    I have a potential solution that involves working with Arrow objects until the last moment when you pull the data into R, though it's not the most elegant.

    # Bring in libraries
    suppressMessages(library(arrow))
    
    # Make data
    tf <- tempfile()
    on.exit(unlink(tf))
    writeLines('
        { "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
        { "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
        { "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
      ', tf, useBytes = TRUE)
    
    # read in the JSON table as an Arrow Table
    my_tbl <- read_json_arrow(tf, col_select = c("hello", "world"), as_data_frame = FALSE)
    complex_cols <- read_json_arrow(tf, col_select = "yo", as_data_frame = FALSE)
    
    # subselect the "yo" column - this is an Arrow ChunkedArray object 
    # containing a Struct at position 0
    yo_col <- complex_cols[["yo"]]
    yo_col
    #> ChunkedArray
    #> <struct<param1: string, param2: string>>
    #> [
    #>   -- is_valid: all not null
    #>   -- child 0 type: string
    #>     [
    #>       "duck1",
    #>       "duck3",
    #>       "duck5"
    #>     ]
    #>   -- child 1 type: string
    #>     [
    #>       "duck2",
    #>       "duck4",
    #>       "duck6"
    #>     ]
    #> ]
    
    # extract the Struct by passing in the chunk number
    sa <- yo_col$chunk(0)
    sa
    #> StructArray
    #> <struct<param1: string, param2: string>>
    #> -- is_valid: all not null
    #> -- child 0 type: string
    #>   [
    #>     "duck1",
    #>     "duck3",
    #>     "duck5"
    #>   ]
    #> -- child 1 type: string
    #>   [
    #>     "duck2",
    #>     "duck4",
    #>     "duck6"
    #>   ]
    
    # extract the "param1" column from the Struct
    param1_col <- sa[["param1"]]
    param1_col
    #> Array
    #> <string>
    #> [
    #>   "duck1",
    #>   "duck3",
    #>   "duck5"
    #> ]
    
    # Add the param1 column to the original Table
    my_tbl[["param1"]] = param1_col
    my_tbl
    #> Table
    #> 3 rows x 3 columns
    #> $hello <double>
    #> $world <bool>
    #> $param1 <string>
    
    # now pull the table into R
    dplyr::collect(my_tbl)
    #> # A tibble: 3 × 3
    #>   hello world param1
    #>   <dbl> <lgl> <chr> 
    #> 1  3.5  FALSE duck1 
    #> 2  3.25 NA    duck3 
    #> 3  0    TRUE  duck5
    

    I was looking for how this is done directly in the tidyverse (we've modelled a lot of the arrow package design after tidyverse design), but many solutions I've seen involve running purrr::map() inside dplyr::select(), a workflow which isn't currently implemented in arrow, and I don't know if it's even possible. Feel free to open a ticket on the repo though if you do want to make a feature request.

    A final note: this probably doesn't make that much difference to memory footprint in the example above, but if you've got lots of nested items to pull out and reassemble into a table, then you'll likely see more benefits.