Search code examples
rjsondata.tablejsonlite

R jsonlite does not parse JSON as per keys correctly


When converting json to columns in R using jsonlite, I came across a bug where the column values were in the wrong columns. I tried replicating it -

R> dt <- data.table(id = 1:2, 
json = c('{"user": "xyz2", "weightmap": "w1", "domains": "d1"}', 
         '{"weightmap": "w2", "user": "abcd", "domains": "d2"}'))                                                                                                                                                                       

R> dt[,fromJSON(json),.(id)]                                                                                                                                                                                                                                                                                                  
   id user weightmap domains
1:  1 xyz2        w1      d1
2:  2   w2      abcd      d2


From what I understand, jsonlite picks the keys from the first row and expects the json in all rows to be in the same order.

I also tried this with different datatypes. Eg. If weightmap was a complex datastructure like a list or another nested json, it would fail saying the columns types are not consistent.

Is this expected behavior ? Should'nt it always read from the keys and split to cols ?

Is there a way to make jsonlite parse the json correctly ?


Solution

  • I wonder if this is a "bug" in data.table, where it is internally treating each as a list instead of a data.frame with names that need to be aligned. We can force it this way:

    dt[, c(.SD, rbindlist(lapply(json, jsonlite::fromJSON), use.names=TRUE))][, json := NULL][]
    #       id   user weightmap domains
    #    <int> <char>    <char>  <char>
    # 1:     1   xyz2        w1      d1
    # 2:     2   abcd        w2      d2
    

    If there is any complexity, what happens depends on the structure. For example, if a weightmap is a list, then

    dt <- data.table(id = 1:2, 
        json = c('{"user": "xyz2", "weightmap": ["w1","w2"], "domains": "d1"}', 
                 '{"weightmap": "w2", "user": "abcd", "domains": "d2"}'))
    dt[, c(.SD, rbindlist(lapply(json, jsonlite::fromJSON), use.names=TRUE))][, json := NULL][]
    # Warning in as.data.table.list(jval, .named = NULL) :
    #   Item 1 has 2 rows but longest item has 3; recycled with remainder.
    # Warning in as.data.table.list(jval, .named = NULL) :
    #   Item 2 has 2 rows but longest item has 3; recycled with remainder.
    #       id   user weightmap domains
    #    <int> <char>    <char>  <char>
    # 1:     1   xyz2        w1      d1
    # 2:     2   xyz2        w2      d1
    # 3:     1   abcd        w2      d2
    

    which achieves an "unnesting" (noisily).