Search code examples
rparsingextract

Splitting strings into components


For example, I have a data table with several columns:

column A                      column B
key_500:station and loc       2
spectra:key_600:type          9
alpha:key_100:number          12

I want to split the rows of column A into components and create new columns, guided by the following rules:

  • the value between "key_" and ":" will be var1,
  • the next value after ":" will be var2,
  • the original column A should retain the part of string that is prior to ":key_". If it is empty (as in the first line), then replace "" with an "effect" word.

My expected final data table should be like this one:

column A   column B  var1  var2
effect     2         500   station and loc
spectra    9         600   type
alpha      12        100   number

Solution

  • Using tidyr extract you can extract specific part of the string using regex.

    tidyr::extract(df, columnA, into = c('var1', 'var2'), 'key_(\\d+):(.*)', 
                   convert = TRUE, remove = FALSE) %>%
      dplyr::mutate(columnA = sub(':?key_.*', '', columnA), 
             columnA = replace(columnA, columnA == '', 'effect'))
    
    #  columnA var1            var2 columnB
    #1  effect  500 station and loc       2
    #2 spectra  600            type       9
    #3   alpha  100          number      12
    

    If you want to use data.table you can break this down in steps :

    library(data.table)
    setDT(df)
    df[, c('var1', 'var2') := .(sub('.*key_(\\d+).*', '\\1',columnA), 
                                sub('.*key_\\d+:', '', columnA))]
    df[, columnA := sub(':?key_.*', '', columnA)]
    df[, columnA := replace(columnA, columnA == '', 'effect')]
    

    data

    df <-  structure(list(columnA = c("key_500:station and loc", 
    "spectra:key_600:type", "alpha:key_100:number"), 
    columnB = c(2L, 9L, 12L)), class = "data.frame", row.names = c(NA, -3L))