Search code examples
rapplysubstrate

Substr() function within the apply() function in R


I have a data frame with 25 million rows and I need to run a substring function to all 25 million rows of data. Because of the size of the data frame I thought apply would be the most efficient way of doing this.

df <- data.frame( seq_start=c(75, 59, 44), 
                  seq_end=c(151, 135, 120), 
                  sequence=c("NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA", "NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG", "NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA")) 

Function to accomplish this that I thought would be the most efficient:

apply(df,1,substr(sequence,seq_start,seq_end))

I'm not familiar with the apply function and a loop is way to inefficient to process 25 million lines.


Solution

  • Not 100% sure what you need/want but it seems that using the dplyrsyntax is useful here (more useful than apply as you're only looking to extract a substring from a single column)

    library(dplyr)
    df %>%
      mutate(substring = substr(sequence,seq_start,seq_end))
      seq_start seq_end
    1        75     151
    2        59     135
    3        44     120
                                                                                                                                                     sequence
    1 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
    2 NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG
    3 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA
                                                                          substring
    1 ATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
    2 TAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACAC
    3 AAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATAT
    

    Base R:

    df$substring <- substr(df$sequence,df$seq_start,df$seq_end)