Search code examples
rdplyrstringr

How to parse a JSON like structure for every row in a dataframe?


I am trying to parse a a column within my dataframe that resembles a JSON-like structure into a new dataframe. Although, when I run my code, it is only outputting the first row of the dataframe.

How do I alter what I currently have so that it applies to every row within the dataframe? The string isn't always a consistent length... Sometimes there are many more pairs than others.

Sample dataframe:

df <- data.frame(
  col = c(
    "[{guid abc123-def456-bf1239435ahfs} {id <nil>} {start_timestamp 1688573993888} {end_timestamp <nil>} {active true}]",
    "[{guid xyz987-pqr654-lmno0987zxywu} {id <nil>} {start_timestamp 1688574000000} {active false}]"
  ),
  stringsAsFactors = FALSE
)

What I have tried so far:

# Extract key-value pairs from the string
pairs <- str_match_all(df$col, "\\{(.*?)\\s(.*?)\\}")[[1]]

# Convert the key-value pairs to a data frame
new_df <- as.data.frame(pairs[, -1], stringsAsFactors = FALSE)

wide_df <- pivot_wider(new_df, names_from = V1, values_from = V2)

Output: enter image description here


Solution

  • You can use a bit of string parsing using strsplit. This keeps the output from each row in its own list element which you can lapply into a single-row data frame. Once you have all your single-row data frames you can bind them into a single data frame with the same number of rows as the original.

    All this can be done in base R:

    do.call(rbind,
      strsplit(df$col, '\\[\\{|\\} \\{|\\}\\]') |>
      lapply(function(x) strsplit(x[-1], ' ')) |>
      lapply(function(x) t(setNames(sapply(x, `[`, 2), sapply(x, `[`, 1)))) |>
      lapply(as.data.frame)
    )
    #>                          guid    id start_timestamp end_timestamp active
    #> 1 abc123-def456-bf1239435ahfs <nil>   1688573993888         <nil>   true
    #> 2 xyz987-pqr654-lmno0987zxywu <nil>   1688574000000         <nil>   true