Search code examples
rgsubstrsplit

Using strsplit when required split character vector is not consistent for all observations in variable (R)


I have data that looks like the following:

   duration                       obs   another
 1 1.801760     ID: 10 DAY: 6/10/13 S    orange
 2 1.868500     ID: 10 DAY: 6/10/13 S     green
 3 0.233562     ID: 10 DAY: 6/10/13 S    yellow
 4 5.538760       ID:96 DAY: 6/8/13 T    yellow
 5 3.436700       ID:96 DAY: 6/8/13 T      blue
 6 0.533856       ID:96 DAY: 6/8/13 T      pink
 7 2.302250       ID:96 DAY: 6/8/13 T    orange
 8 2.779420       ID:96 DAY: 6/8/13 T     green

I have included just 3 variables, though in reality my data has many. My issue is to do with the ugly looking "obs" variable. I received these data from another individual who entered this information inconsistently into the software that they were using.

'obs' contains three pieces of information: - id (ID: 10, ID:96, etc.) - date (M/D/Y) - identifier (S or T)

I want to split this information and extract the ID number (10 or 96), the date (e.g. 6/8/13) and the identifier (S or T).

To do this I tried the following using strsplit:

temp<-strsplit(as.character(df$obs), " ")
mat<-matrix(unlist(temp), ncol=5, byrow=TRUE)

I thought this would work as in my real data, I have >130,000 observations and I did not realize that some observations had the issue whereby the id did not have a whitespace " " between the "ID:" and the number. In the data above for instance, "ID:96" does not have white space between the colon and the number. Obviously, I got this warning message:

Warning message:
  In matrix(unlist(temp), ncol = 5, byrow = TRUE) :
  data length [796454] is not a sub-multiple or multiple of the number of rows [159291]

Clearly, the strsplit cannot be coerced into nice regular columns as the output of the strsplit takes two forms:

[1] "ID:"     "10"      "DAY:"    "6/10/13" "S"   #when there is whitespace
[1] "ID:96"  "DAY:"   "6/8/13" "T"   #when there isn't whitespace

To try and get round this I did this, thinking that if I could introduce any space after the 'ID:' it could work:

df$obs <- gsub("ID:", "ID: ", df$obs)

But this didn't work as when I then did the strsplit, it would recognize the double whitespace as two places to split the data.

If anybody knows of a solution for multiple strsplits that can then be coerced back into the original df with separate columns for the idnumber, date, identifier, that would be great.

edit: sorry, forgot to add the data for a reproducible example:

df<-structure(list(duration = c(1.80176, 1.8685, 0.233562, 5.53876, 
                        3.4367, 0.533856, 2.30225, 2.77942), obs = structure(c(1L, 1L, 
                                                                               1L, 2L, 2L, 2L, 2L, 2L), .Label = c("ID: 10 DAY: 6/10/13 S", 
                                                                                                                   "ID:96 DAY: 6/8/13 T"), class = "factor"), another = structure(c(3L, 
                                                                                                                                                                                    2L, 5L, 5L, 1L, 4L, 3L, 2L), .Label = c("blue", "green", "orange", 
                                                                                                                                                                                                                            "pink", "yellow"), class = "factor")), .Names = c("duration", 
                                                                                                                                                                                                                                                                              "obs", "another"), class = "data.frame", row.names = c(NA, -8L
                                                                                                                                                                                                                                                                              ))

Solution

  • After you fire that data entry person, I might consider a regular expression here to capture the data. First, here's just the data in the "obs" column (adding the additional value form your comment)

    obs<-c("ID: 10 DAY: 6/10/13 S", "ID: 10 DAY: 6/10/13 S", "ID: 10 DAY: 6/10/13 S", 
    "ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", 
    "ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", "ID: 84DAY: 6/8/13 T")
    

    Next, I can capture the data with

    m<-regexpr("ID:\\s*(\\d+) ?DAY: (\\d+/\\d+/\\d+) (S|T)", obs, perl=T)
    

    Next, I use a helper function regcapturedmatches() to extract the captured matches (it works like regmatches() but for capture groups)

    do.call(rbind, regcapturedmatches(obs,m))
    
    #      [,1] [,2]      [,3]
    # [1,] "10" "6/10/13" "S" 
    # [2,] "10" "6/10/13" "S" 
    # [3,] "10" "6/10/13" "S" 
    # [4,] "96" "6/8/13"  "T" 
    # [5,] "96" "6/8/13"  "T" 
    # [6,] "96" "6/8/13"  "T" 
    # [7,] "96" "6/8/13"  "T" 
    # [8,] "96" "6/8/13"  "T" 
    # [9,] "84" "6/8/13"  "T"
    

    This returns a matrix of values. You can then process those character values how ever you like. You can convert them to the correct class and attach to your data.frame.

    But if you did want to use a strsplit, you could split on either a ":" or spaces with an options preceding ":"

    do.call(rbind, strsplit(obs,"(:|:?\\s+)", obs))
    
    #      [,1] [,2]    [,3]     [,4]      [,5]
    # [1,] "ID" "10"    "DAY"    "6/10/13" "S" 
    # [2,] "ID" "10"    "DAY"    "6/10/13" "S" 
    # [3,] "ID" "10"    "DAY"    "6/10/13" "S" 
    # [4,] "ID" "96"    "DAY"    "6/8/13"  "T" 
    # [5,] "ID" "96"    "DAY"    "6/8/13"  "T" 
    # [6,] "ID" "96"    "DAY"    "6/8/13"  "T" 
    # [7,] "ID" "96"    "DAY"    "6/8/13"  "T" 
    # [8,] "ID" "96"    "DAY"    "6/8/13"  "T" 
    # [9,] "ID" "84DAY" "6/8/13" "T"       "ID"
    

    which works up till your latest new line of bad data.