I have thousands of very small json
files in a directory.
Right now, I am using the following code to load them:
library(dplyr)
library(jsonlite)
library(purrr)
filelistjson <- list.files(DATA_DIRECTORY, full.names = TRUE, recursive = TRUE)
filelistjson %>% map(., ~fromJSON(file(.x)))
Unfortunately, this is extremely slow (I also tried with furrr::future_map
) I wonder if there is a better approach here. the individual files are barely 25KB
in size...
The files look look like the following, with a couple nested variables but nothing too complicated
{
"field1": "hello world",
"funny": "yes",
"date": "abc1234",
"field3": "hakuna matata",
"nestedvar":[
"http://www.stackoverflow.com",
"http://www.stackoverflow.com/funny"
],
"othernested":[
{
"one": "two",
"test": "hello"
}
]
}
Thanks!
There are several JSON libraries in R. Here are benchmarks for three of the libraries:
txt <- '{
"field1": "hello world",
"funny": "yes",
"date": "abc1234",
"field3": "hakuna matata",
"nestedvar": [
"http://www.stackoverflow.com",
"http://www.stackoverflow.com/funny"
],
"othernested": [
{
"one" : "two",
"test" : "hello"
}
]
}'
microbenchmark::microbenchmark(
jsonlite={
jsonlite::fromJSON(txt)
},
RJSONIO={
RJSONIO::fromJSON(txt)
},
rjson={
rjson::fromJSON(txt)
}
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# jsonlite 144.047 153.3455 173.92028 167.021 172.491 456.935 100 c
# RJSONIO 113.049 120.3420 134.94045 128.365 132.742 287.727 100 b
# rjson 10.211 12.4000 17.10741 17.140 18.234 59.807 100 a
As you can see, rjson
seems to be more efficient (though treat the above results with caution). Personally, I like working with RJSONIO
as it is the library that in my experience respects best the formats when reading, modifying and parsing again.
Finally, if you know the (invariant) structure of your files, you can always build a custom JSON reader and maybe be more efficient. But as indicated by @Gregor, maybe you ought to make sure the latency is truly due to the reader.