Search code examples
rjsonpurrrjsonlitefurrr

how to quickly parse many small JSON files?


I have thousands of very small json files in a directory.

Right now, I am using the following code to load them:

library(dplyr)
library(jsonlite)
library(purrr)

filelistjson <- list.files(DATA_DIRECTORY, full.names = TRUE, recursive = TRUE)
filelistjson %>% map(., ~fromJSON(file(.x)))

Unfortunately, this is extremely slow (I also tried with furrr::future_map) I wonder if there is a better approach here. the individual files are barely 25KB in size...

The files look look like the following, with a couple nested variables but nothing too complicated

  {
 "field1": "hello world",
  "funny": "yes",
  "date": "abc1234",
  "field3": "hakuna matata",
  "nestedvar":[
    "http://www.stackoverflow.com",
    "http://www.stackoverflow.com/funny"
  ],
  "othernested":[
   { 
     "one": "two",
     "test": "hello"
   }
   ] 
  }

Thanks!


Solution

  • There are several JSON libraries in R. Here are benchmarks for three of the libraries:

    txt <- '{
     "field1": "hello world",
    "funny": "yes",
    "date": "abc1234",
    "field3": "hakuna matata",
    "nestedvar": [
    "http://www.stackoverflow.com",
    "http://www.stackoverflow.com/funny"
    ],
    "othernested": [
    { 
    "one" : "two",
    "test" : "hello"
    }
    ] 
    }'
    
    microbenchmark::microbenchmark(
      jsonlite={
        jsonlite::fromJSON(txt)
      },
      RJSONIO={
        RJSONIO::fromJSON(txt)
      },
      rjson={
        rjson::fromJSON(txt)
      }
    )
    
    # Unit: microseconds
    #     expr     min       lq      mean  median      uq     max neval cld
    # jsonlite 144.047 153.3455 173.92028 167.021 172.491 456.935   100   c
    #  RJSONIO 113.049 120.3420 134.94045 128.365 132.742 287.727   100  b 
    #    rjson  10.211  12.4000  17.10741  17.140  18.234  59.807   100 a 
    

    As you can see, rjson seems to be more efficient (though treat the above results with caution). Personally, I like working with RJSONIO as it is the library that in my experience respects best the formats when reading, modifying and parsing again.

    Finally, if you know the (invariant) structure of your files, you can always build a custom JSON reader and maybe be more efficient. But as indicated by @Gregor, maybe you ought to make sure the latency is truly due to the reader.