Search code examples
rdplyrrjson

How to loop over JSON array in a JSON object


I've been trying to learn R and I have a JSON file full of single line JSON objects, and each object has an array of account data. What I'm trying to do is parse each row, then get the JSON array out of the parsed JSON object, pull the account type and the amount. But my problem is that I don't know how best to pull just those two attributes out.

I've tried using the dplyr package to pull "accountHistory" out of each of my JSON lines, but I get a console error. When I try:

select(JsonAcctData, "accountHistory.type", "accountHistory.amount")

What happens is, my code only returns the last account for each row's type and amount.

Right now my code is writing to a csv file and I can see all the data I need, but I just want to remove the ext

library("rjson")
library("dplyr")

parseJsonData <- function (sourceFile, outputFile) 
{
  #Get all total lines in the source file provided
  totalLines <- readLines(sourceFile)

  #Clean up old output file
  if(file.exists(outputFile)){
    file.remove(outputFile)
  }

  #Loop over each line in the sourceFile, 
  #parse the JSON and append to DataFrame
  JsonAcctData <- NULL
  for(i in 1:length(totalLines)){
    jsonValue <- fromJSON(totalLines[[i]])
    frame <- data.frame(jsonValue)
    JsonAcctData <- rbind(JsonAcctData, frame)
  }

  #Try to get filtered data
  filteredColumns <- 
    select(JsonAcctData, "accountHistory.type", "accountHistory.amount")
  print(filteredColumns)

  #Write the DataFrame to the output file in CSV format
  write.csv(JsonAcctData, file = outputFile)

}

Test JSON File Data:

{"name":"Test1", "accountHistory":[{"amount":"107.62","date":"2012-02- 
  02T06:00:00.000Z","business":"CompanyA","name":"Home Loan Account 
  6220","type":"payment","account":"11111111"}, 
  {"amount":"650.88","date":"2012-02- 
  02T06:00:00.000Z","business":"CompanyF","name":"Checking Account 
  9001","type":"payment","account":"123123123"}, 
  {"amount":"878.63","date":"2012-02- 
  02T06:00:00.000Z","business":"CompanyG","name":"Money Market Account 
  8743","type":"deposit","account":"123123123"}]}
  {"name":"Test2", "accountHistory":[{"amount":"199.29","date":"2012-02-            
  02T06:00:00.000Z","business":"CompanyB","name":"Savings Account 
  3580","type":"invoice","account":"12312312"}, 
  {"amount":"841.48","date":"2012-02- 
  02T06:00:00.000Z","business":"Company","name":"Home Loan Account 
  5988","type":"payment","account":"123123123"}, 
  {"amount":"116.55","date":"2012-02- 
  02T06:00:00.000Z","business":"Company","name":"Auto Loan Account 
  1794","type":"withdrawal","account":"12312313"}]}

what I would expect is to get a csv that has just the account types and the amounts held in each account.


Solution

  • Here is a way using regex (in base R)

    # read json 
    json <- readLines('test.json', warn = FALSE)
    # extract with regex
    amount <- grep('\"amount\":\"\\d+\\.\\d+\"', json, value = TRUE)
    amount <- as.numeric(gsub('.*amount\":\"(\\d+\\.+\\d+)\".*', '\\1', amount, perl = TRUE))
    type   <- grep('\"type\":\"\\w+\"', json, value = TRUE)
    type   <- gsub('.*type\":\"(\\w+)\".*', '\\1', type, perl = TRUE)
    # output
    data.frame(type, amount)
    #         type amount
    # 1    payment 107.62
    # 2    payment 650.88
    # 3    deposit 878.63
    # 4    invoice 199.29
    # 5    payment 841.48
    # 6 withdrawal 116.55