Search code examples
rdataframedata.tablesubstring

How to extract part of cell value across columns?


I have a data frame like this:

df1<-structure(list(q006_1 = c("1098686880", "18493806","9892464","96193586",
                               "37723803","13925456","37713534","1085246853"),
                    q006_2 = c("1098160170","89009521","9726314","28076230","63451251",
                               "1090421499","37124019"),
                    q006_3 = c("52118967","41915062","1088245358","79277706","91478662",
                               "80048634")), 
               class=data.frame, row.names = c(NA, -8L)))

I know how to extract the last five digits of each number for one column using substr in data.table but I want to do it across all columns.

n_last <- 5  

df1[, `q006_1`:= substr(q006_1, nchar(q006_1) - n_last + 1, nchar(q006_1))]

How can I do this for all columns?


Solution

  • In data.table it can be done like below: (Your sample data was incomplete as the first column had 8, second column had 7 and the third had 6 entries.)

    library(data.table)
    
    #or `cols <- names(df1)` if you want to apply it on all columns and this is not just an example
    cols <- c("q006_1", "q006_2", "q006_3") 
    
    setDT(df1)[ , (cols):= lapply(.SD, function(x){
                                       sub('.*(?=.{5}$)', '', x, perl=T)}),
                 .SDcols = cols][]
    
    #     q006_1 q006_2 q006_3
    # 1:  86880  60170  18967
    # 2:  93806  09521  15062
    # 3:  92464  26314  45358
    # 4:  93586  76230  77706
    # 5:  23803  51251  78662
    # 6:  25456  21499  48634
    # 7:  13534  24019  76230
    # 8:  46853  76230  76230
    

    Data:

    df1<-structure(list(q006_1 = c("1098686880", "18493806","9892464","96193586",
                                   "37723803","13925456","37713534","1085246853"),
                        q006_2 = c("1098160170","89009521","9726314","28076230",
                                   "63451251","1090421499","37124019","28076230"),
                        q006_3 = c("52118967","41915062","1088245358","79277706",
                                   "91478662","80048634","28076230","28076230")),
                    class = c("data.frame"), row.names = c(NA, -8L))