Search code examples
jsonrrecursionreddit

Recursive JSON tree index extraction in R


I'm having difficulties extracting the structure of a JSON tree in R.

Consider the following scenario (extracting data from Reddit.com):

library(RJSON)
URL = "http://www.reddit.com/r/newzealand/comments/3p25qy/where_can_i_get_a_chromecast_2/"

X = paste0(gsub("\\?ref=search_posts$","",URL),".json?limit=500")
raw_data = fromJSON(readLines(X, warn = FALSE))
main.node = raw_data[[2]]$data$children
replies = main.node[[2]]$data$replies
node = replies$data$children

Now main.node[[1]] holds attributes corresponding to the first comment on the URL, while replies holds information about the replies to the second comment. We can find these replies by looking at replies$data$children. But a reply can be nested inside another reply, which is why to get them all, we need to parse the tree recursively.

The below table represents the structure of comments together with the output I'm trying to get:

row  comment | reply_to_comment | reply_to_reply | desired_output
1)   *       |                  |                | 1
2)           | *                |                | 1.1
3)           | *                |                | 1.2
4)           |                  | *              | 1.2.1
5)           |                  | *              | 1.2.2
6)           | *                |                | 1.3
7)   *       |                  |                | 2
8)           | *                |                | 2.1
9)           |                  | *              | 2.1.1

The closest I could get to this so far is represented by the below code:

 node = main.node
reply_function = function(node){
  struct   = seq_along(node) 
  replies  = node$data$replies
  rep.node = if (is.list(replies)) replies$data$children else NULL
  return(list(struct,lapply(rep.node,function(x) reply_function(x))))
}
[1]  1 2 1 2 1 2 1 2 1 2 1 2 1 2

Note that the numbers may change if you rerun it -- this data is dynamic.

This approach however doesn't contain the history of the entire thread, it only tells us how many replies a certain node may have, regardless of whether it is an original comment or it is a reply to a reply.

If anyone has any suggestions on how to, please let me know, I'd love to hear from you.

Many thanks!


Solution

  • Here's a method using a modified version of a previously answered rjson reader.

    First, we can modify the previous recursive reader to keep count of what level it is on:

    get.comments <- function(node, depth=0) {
      if(is.null(node)) {return(list())}
      comment     <- node$data$body
      replies     <- node$data$replies
      reply.nodes <- if (is.list(replies)) replies$data$children else NULL
      return(list(paste0(comment, " ", depth), lapply(1:length(reply.nodes), function(x) get.comments(reply.nodes[[x]], paste0(depth, ".", x)))))
    }
    

    Now read your data in:

    library(rjson)
    URL = "http://www.reddit.com/r/newzealand/comments/3p25qy/where_can_i_get_a_chromecast_2/"
    X = paste0(gsub("\\?ref=search_posts$","",URL),".json?limit=500")
    rawdat    <- fromJSON(readLines(X, warn = FALSE))
    main.node <- rawdat[[2]]$data$children
    

    Then apply the function recursively and unlist:

    txt <- unlist(lapply(1:length(main.node), function(x) get.comments(main.node[[x]], x)))
    

    Now txt is a vector of the comments, with the level at the very end. Eg

    "Holy fuck, thank you! Didn't realise this was actually a thing.\n\nfreeeedom 1.1" 
    

    We can split by the terminal space, and get the data.frame:

    z<-as.data.frame(do.call(rbind, strsplit(txt, ' (?=[^ ]+$)', perl = TRUE)))
    
           V2
    1       1
    2     1.1
    3   1.1.1
    4 1.1.1.1
    5       2
    6       3
    7       4
    8     4.1
    9     4.2