Search code examples
rdplyrsubstring

Subsetting from serialized string in R


I get a field from a SQL database in which various information is stored as a serialized object. I would like to extract the values ​​from this string in R and save them as separate columns.

Here is a simplified example:

df <- data.frame(ID = c(1,2,3),
                 information = c("{s:4:\"name\";s:3:\"max\";s:3:\"age\";s:1:\"8\";}",
                                 "{s:4:\"name\";s:5:\"peter\";s:3:\"age\";s:2:\"10\";}",
                                 "{s:4:\"name\";s:4:\"susy\";s:3:\"age\";s:3:\"100\";}") 
                 )

df <- df %>% 
  mutate(
    name = substring(information, gregexpr('"', information)[[1]][3] +1, gregexpr('"', information)[[1]][4]-1)
  )

df

I tried to select the value between the third and fourth quotes. Unfortunately, this selection only works for the first value. For the other selections, the positions are obviously taken from the first value.

[1] "max" "pet" "sus"

Thank you for your ideas


Solution

  • Split on delimiter and read as delimited text, keep columns we need:

    read.delim(text = gsub(";", ":", df$information, fixed = TRUE),
               sep = ":", header = FALSE)[, c(3, 6, 9, 12)]
    #     V3    V6  V9 V12
    # 1 name   max age   8
    # 2 name peter age  10
    # 3 name  susy age 100
    

    Note: there is a pattern seq(3, 12, 3) to automate the column selection.