I am trying to parse a a column within my dataframe that resembles a JSON-like structure into a new dataframe. Although, when I run my code, it is only outputting the first row of the dataframe.
How do I alter what I currently have so that it applies to every row within the dataframe? The string isn't always a consistent length... Sometimes there are many more pairs than others.
Sample dataframe:
df <- data.frame(
col = c(
"[{guid abc123-def456-bf1239435ahfs} {id <nil>} {start_timestamp 1688573993888} {end_timestamp <nil>} {active true}]",
"[{guid xyz987-pqr654-lmno0987zxywu} {id <nil>} {start_timestamp 1688574000000} {active false}]"
),
stringsAsFactors = FALSE
)
What I have tried so far:
# Extract key-value pairs from the string
pairs <- str_match_all(df$col, "\\{(.*?)\\s(.*?)\\}")[[1]]
# Convert the key-value pairs to a data frame
new_df <- as.data.frame(pairs[, -1], stringsAsFactors = FALSE)
wide_df <- pivot_wider(new_df, names_from = V1, values_from = V2)
You can use a bit of string parsing using strsplit
. This keeps the output from each row in its own list element which you can lapply
into a single-row data frame. Once you have all your single-row data frames you can bind them into a single data frame with the same number of rows as the original.
All this can be done in base R:
do.call(rbind,
strsplit(df$col, '\\[\\{|\\} \\{|\\}\\]') |>
lapply(function(x) strsplit(x[-1], ' ')) |>
lapply(function(x) t(setNames(sapply(x, `[`, 2), sapply(x, `[`, 1)))) |>
lapply(as.data.frame)
)
#> guid id start_timestamp end_timestamp active
#> 1 abc123-def456-bf1239435ahfs <nil> 1688573993888 <nil> true
#> 2 xyz987-pqr654-lmno0987zxywu <nil> 1688574000000 <nil> true