Search code examples
regexrdataframesapplystrsplit

R: regex from string to two dimensional data frame in one command?


I have a string s containing such key-value pairs, and I would like to construct from it data frame,

s="{'#JJ': 121, '#NN': 938, '#DT': 184, '#VB': 338, '#RB': 52}"
r1<-sapply(strsplit(s, "[^0-9_]+",as.numeric),as.numeric)
r2<-sapply(strsplit(s, "[^A-Z]+",as.numeric),as.character)
d<-data.frame(id=r2,value=r1)

what gives:

r1
     [,1]
[1,]   NA
[2,]  121
[3,]  938
[4,]  184
[5,]  338
[6,]   52
 r2
     [,1]
[1,] ""  
[2,] "JJ"
[3,] "NN"
[4,] "DT"
[5,] "VB"
[6,] "RB"

 d
  id value
1       NA
2 JJ   121
3 NN   938
4 DT   184
5 VB   338
6 RB    52

First I would like don't have NA and "" after using regular expression. I think it should be something like {2,} meaning match all from second occurence, but I can not do that in R.

Another think I would like to do will be: having a data frame with column like below:

                                                              m
1   {'#JJ': 121, '#NN': 938, '#DT': 184, '#VB': 338, '#RB': 52}
2       {'#NN': 168, '#DT': 59, '#VB': 71, '#RB': 5, '#JJ': 35}
3      {'#JJ': 18, '#NN': 100, '#DT': 23, '#VB': 52, '#RB': 11}
4      {'#NN': 156, '#JJ': 39, '#DT': 46, '#VB': 67, '#RB': 21}
5       {'#NN': 112, '#DT': 39, '#VB': 57, '#RB': 8, '#JJ': 32}
6  {'#DT': 236, '#NN': 897, '#VB': 420, '#RB': 122, '#JJ': 240}
7     {'#NN': 316, '#RB': 25, '#DT': 66, '#VB': 112, '#JJ': 81}
8      {'#NN': 198, '#DT': 29, '#VB': 85, '#RB': 37, '#JJ': 44}
9                                                   {'#RB': 30}
10     {'#NN': 373, '#DT': 48, '#VB': 71, '#RB': 21, '#JJ': 36}
11       {'#NN': 49, '#DT': 17, '#VB': 23, '#RB': 11, '#JJ': 8}
12  {'#NN': 807, '#JJ': 135, '#DT': 177, '#VB': 315, '#RB': 69}

I would like to iterate over each row and split it numerical values into the columns named by the key.

Example of few rows showing, how I would like it will looks like:

enter image description here


Solution

  • I would use something that parses JSON, what your data seems to be:

    s <- "{'#JJ': 121, '#NN': 938, '#DT': 184, '#VB': 338, '#RB': 52}"
    
    parse.one <- function(s) {
      require(rjson)
      v <- fromJSON(gsub("'", '"', s))
      data.frame(id = gsub("#", "", names(v)),
                 value = unlist(v, use.names = FALSE))  
    }
    
    parse.one(s)
    #   id value
    # 1 JJ   121
    # 2 NN   938
    # 3 DT   184
    # 4 VB   338
    # 5 RB    52
    

    For the second part of the question, I would pass a slightly modified version of the parse.one function through lapply, then let plyr's rbind.fill function align the pieces together while filling missing values with NA:

    df <- data.frame(m = c(
      "{'#JJ': 121, '#NN': 938, '#DT': 184, '#VB': 338, '#RB': 52}",
      "{'#NN': 168, '#DT': 59, '#VB': 71, '#RB': 5, '#JJ': 35}",
      "{'#JJ': 18, '#NN': 100, '#DT': 23, '#VB': 52, '#RB': 11}",
      "{'#JJ': 12, '#VB': 5}"
    ))
    
    parse.one <- function(s) {
      require(rjson)
      y <- fromJSON(gsub("'", '"', s))
      names(y) <- gsub("#", "", names(y))
      as.data.frame(y)
    }
    
    library(plyr)
    rbind.fill(lapply(df$m, parse.one))
    #    JJ  NN  DT  VB RB
    # 1 121 938 184 338 52
    # 2  35 168  59  71  5
    # 3  18 100  23  52 11
    # 4  12  NA  NA   5 NA