Search code examples
rsalesforce-lightningjsonlite

How to deal with nested empty JSON arrays with tidyjson in R


I am reading a JSON object from Salesforce. The object is irregular in the sense that some nested arrays are empty and some are not. How to deal with this in tidyjson?

I am setting up an API with Salesforce in R. The objective is to get meaningful data out of Salesforce to process in R.

json <- '
{
  "totalSize": [
    355710
  ],
  "done": [
    false
  ],
  "nextRecordsUrl": [
    "/services/data/v45.0/query/01gc000001L8zdkAAB-749"
  ],
  "records": [
    {
      "attributes": {
        "type": "Order_Line__c",
        "url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9lUAE"
      },
      "Id": "a0T1N000009aZ9lUAE",
      "Name": "OrderLine-1099369",
      "SO_Number_Formula__c": "548402-2.3",
      "Ship_From_Inventory__c": "XXX",
      "RMA_Number__c": "548402",
      "Part_Number__c": "01t1N00000JNeAQQA1",
      "Marketing_Part__c": "XXXXXXXXXXX",
      "Family__c": "XXXXXXXX",
      "Serial_Numbers__r": {
        "records": {}
      }
    },
    {
      "attributes": {
        "type": "Order_Line__c",
        "url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9mUAE"
      },
      "Id": "a0T1N000009aZ9mUAE",
      "Name": "OrderLine-1099370",
      "SO_Number_Formula__c": "962816-1.1",
      "Ship_From_Inventory__c": "XXX",
      "RMA_Number__c": "962816",
      "Part_Number__c": "01t1N00000JNc3qQAD",
      "Marketing_Part__c": "XXXXXXXXXX",
      "Family__c": "XXXXXXX",
      "RMA_Received_Date__c": "2019-02-18",
      "Serial_Numbers__r": {
        "totalSize": 1,
        "done": true,
        "records": [
          {
            "attributes": {
              "type": "Serial_Number__c",
              "url": "/services/data/v45.0/sobjects/Serial_Number__c/a0X1N00000NoyAjUAJ"
            },
            "Id": "a0X1N00000NoyAjUAJ",
            "Name": "SN217426",
            "Legacy_Line_Id__c": "962816SN217426",
            "Customer_Name__c": "XXXXXX",
            "Original_Shipment_Date__c": "2018-06-26",
            "Disposition__c": "Pending",
            "Status__c": "FailureVerification"
          }
        ]
      }
    }
  ]
}
'

mydata <- json %>% 
    as.tbl_json %>%
    enter_object("records") %>%
    gather_array() %>%
    spread_values(
      Id = jstring("Id"),
      Name = jstring("Name"),
      SO_Number_Formula = jstring("SO_Number_Formula__c"),
      Ship_From_Inventory = jstring("Ship_From_Inventory__c"),
      RMA_Number = jstring("RMA_Number__c"),
      Part_Number = jstring("Part_Number__c"),
      Marketing_Part = jstring("Marketing_Part__c"),
      Family = jstring("Family__c")) %>%
    enter_object("Serial_Numbers__r") %>%
    enter_object("records") %>%
    gather_ %>%
      spread_values(
    Id = jstring("Id"))

The irregularity is in the [records][Serial_Numbers__r][records]. In this example, the first occurrence is empty {} and the second occurrence is not empty. The code generates the following error when executing gather_keys or gather _array: Error in gather_keys(.) : 1 records are values not objects Error in gather_array(.) : 1 records are values not arrays

I am thinking that this is caused by the empty array [records]. There is plenty of such irregularity in the Salesforce output: some records have detailed nested data, some do not. How can I deal with this?


Solution

  • This is a fantastic question, and is something that we should really have a cleaner way to handle. enter_object() proves problematic in these types of cases where you lose records based on irregular JSON practices.

    I submitted an issue to track improvements here: https://github.com/colearendt/tidyjson/issues/121

    In the meantime, the way I typically do this is by splitting records based on the characteristic that delineates them. In this case, you can use gather_object() on the parent object to get the same effect as enter_object(), and then use filter / bind_rows to treat the rows differently.

    Ideally bind_rows() would work better in the pipe here... that is something I would like to see as an improvement for dplyr (Issue here)! I'm curious to see if this solves your problem! (Also, keep spread_all() in mind to simplify some of the column specifying at a cost of some "guessing" on the package's part!).

      json <- '{
      "totalSize": [
        355710
      ],
      "done": [
        false
      ],
      "nextRecordsUrl": [
        "/services/data/v45.0/query/01gc000001L8zdkAAB-749"
      ],
      "records": [
        {
          "attributes": {
            "type": "Order_Line__c",
            "url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9lUAE"
          },
          "Id": "a0T1N000009aZ9lUAE",
          "Name": "OrderLine-1099369",
          "SO_Number_Formula__c": "548402-2.3",
          "Ship_From_Inventory__c": "XXX",
          "RMA_Number__c": "548402",
          "Part_Number__c": "01t1N00000JNeAQQA1",
          "Marketing_Part__c": "XXXXXXXXXXX",
          "Family__c": "XXXXXXXX",
          "Serial_Numbers__r": {
            "records": {}
          }
        },
        {
          "attributes": {
            "type": "Order_Line__c",
            "url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9mUAE"
          },
          "Id": "a0T1N000009aZ9mUAE",
          "Name": "OrderLine-1099370",
          "SO_Number_Formula__c": "962816-1.1",
          "Ship_From_Inventory__c": "XXX",
          "RMA_Number__c": "962816",
          "Part_Number__c": "01t1N00000JNc3qQAD",
          "Marketing_Part__c": "XXXXXXXXXX",
          "Family__c": "XXXXXXX",
          "RMA_Received_Date__c": "2019-02-18",
          "Serial_Numbers__r": {
            "totalSize": 1,
            "done": true,
            "records": [
              {
                "attributes": {
                  "type": "Serial_Number__c",
                  "url": "/services/data/v45.0/sobjects/Serial_Number__c/a0X1N00000NoyAjUAJ"
                },
                "Id": "a0X1N00000NoyAjUAJ",
                "Name": "SN217426",
                "Legacy_Line_Id__c": "962816SN217426",
                "Customer_Name__c": "XXXXXX",
                "Original_Shipment_Date__c": "2018-06-26",
                "Disposition__c": "Pending",
                "Status__c": "FailureVerification"
              }
            ]
          }
        }
      ]
    }
    '
    
      library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
      library(tidyr)
      library(tidyjson)
    #> 
    #> Attaching package: 'tidyjson'
    #> The following object is masked from 'package:dplyr':
    #> 
    #>     bind_rows
    #> The following object is masked from 'package:stats':
    #> 
    #>     filter
    
      prep_data <- json %>%
        as.tbl_json %>%
        enter_object("records") %>%
        gather_array() %>%
        spread_values(
          Id = jstring("Id"),
          Name = jstring("Name"),
          SO_Number_Formula = jstring("SO_Number_Formula__c"),
          Ship_From_Inventory = jstring("Ship_From_Inventory__c"),
          RMA_Number = jstring("RMA_Number__c"),
          Part_Number = jstring("Part_Number__c"),
          Marketing_Part = jstring("Marketing_Part__c"),
          Family = jstring("Family__c")) %>%
        enter_object("Serial_Numbers__r")
    
      # show that types are different
      prep_data %>%
        gather_object("key") %>%
        json_types() %>%
        select(key, type) %>%
        filter(key == "records")
    #> # A tbl_json: 2 x 2 tibble with a "JSON" attribute
    #>   `attr(., "JSON")`      key     type  
    #>   <chr>                  <chr>   <fct> 
    #> 1 "{}"                   records object
    #> 2 "[{\"attributes\":..." records array
    
      # handle
      taller <- prep_data %>%
        gather_object("key") %>%
        json_types("type") %>%
        filter(key == "records")
    
      final <- tidyjson::bind_rows(
        taller %>% filter(type == "object"),
        taller %>% filter(type == "array") %>%
          gather_array("record_row") %>%
          spread_values(
            RecordId = jstring("Id")
          )
      )
    
      final %>% select(key, type, record_row, RecordId)
    #> # A tbl_json: 2 x 4 tibble with a "JSON" attribute
    #>   `attr(., "JSON")`      key     type   record_row RecordId          
    #>   <chr>                  <chr>   <fct>       <int> <chr>             
    #> 1 "{}"                   records object         NA <NA>              
    #> 2 "{\"attributes\":{..." records array           1 a0X1N00000NoyAjUAJ
    

    Created on 2020-03-15 by the reprex package (v0.3.0)