I've been trying to learn R and I have a JSON file full of single line JSON objects, and each object has an array of account data. What I'm trying to do is parse each row, then get the JSON array out of the parsed JSON object, pull the account type and the amount. But my problem is that I don't know how best to pull just those two attributes out.
I've tried using the dplyr package to pull "accountHistory" out of each of my JSON lines, but I get a console error. When I try:
select(JsonAcctData, "accountHistory.type", "accountHistory.amount")
What happens is, my code only returns the last account for each row's type and amount.
Right now my code is writing to a csv file and I can see all the data I need, but I just want to remove the ext
library("rjson")
library("dplyr")
parseJsonData <- function (sourceFile, outputFile)
{
#Get all total lines in the source file provided
totalLines <- readLines(sourceFile)
#Clean up old output file
if(file.exists(outputFile)){
file.remove(outputFile)
}
#Loop over each line in the sourceFile,
#parse the JSON and append to DataFrame
JsonAcctData <- NULL
for(i in 1:length(totalLines)){
jsonValue <- fromJSON(totalLines[[i]])
frame <- data.frame(jsonValue)
JsonAcctData <- rbind(JsonAcctData, frame)
}
#Try to get filtered data
filteredColumns <-
select(JsonAcctData, "accountHistory.type", "accountHistory.amount")
print(filteredColumns)
#Write the DataFrame to the output file in CSV format
write.csv(JsonAcctData, file = outputFile)
}
Test JSON File Data:
{"name":"Test1", "accountHistory":[{"amount":"107.62","date":"2012-02-
02T06:00:00.000Z","business":"CompanyA","name":"Home Loan Account
6220","type":"payment","account":"11111111"},
{"amount":"650.88","date":"2012-02-
02T06:00:00.000Z","business":"CompanyF","name":"Checking Account
9001","type":"payment","account":"123123123"},
{"amount":"878.63","date":"2012-02-
02T06:00:00.000Z","business":"CompanyG","name":"Money Market Account
8743","type":"deposit","account":"123123123"}]}
{"name":"Test2", "accountHistory":[{"amount":"199.29","date":"2012-02-
02T06:00:00.000Z","business":"CompanyB","name":"Savings Account
3580","type":"invoice","account":"12312312"},
{"amount":"841.48","date":"2012-02-
02T06:00:00.000Z","business":"Company","name":"Home Loan Account
5988","type":"payment","account":"123123123"},
{"amount":"116.55","date":"2012-02-
02T06:00:00.000Z","business":"Company","name":"Auto Loan Account
1794","type":"withdrawal","account":"12312313"}]}
what I would expect is to get a csv that has just the account types and the amounts held in each account.
Here is a way using regex
(in base R
)
# read json
json <- readLines('test.json', warn = FALSE)
# extract with regex
amount <- grep('\"amount\":\"\\d+\\.\\d+\"', json, value = TRUE)
amount <- as.numeric(gsub('.*amount\":\"(\\d+\\.+\\d+)\".*', '\\1', amount, perl = TRUE))
type <- grep('\"type\":\"\\w+\"', json, value = TRUE)
type <- gsub('.*type\":\"(\\w+)\".*', '\\1', type, perl = TRUE)
# output
data.frame(type, amount)
# type amount
# 1 payment 107.62
# 2 payment 650.88
# 3 deposit 878.63
# 4 invoice 199.29
# 5 payment 841.48
# 6 withdrawal 116.55