Search code examples
rsubstringapplyreplicate

function to sample variable number of substrings given string length


I'm trying to write an R function that will sample a variable number of 5-element substrings, based on the length of the original string in each row of a data frame. I first calculated the number of times I'd like each draw to repeat, and would like to add this into the function so that the number of samples taken for each row is based on the "num_draws" column for that row. my thought was to use a generalized instance, and then use an apply statement outside of the function to act on each row, but I can't figure out how to set up the function to call col 3 as a generalized instance (without calling either the value of just the first row, or the value of all rows).

example data frame:

  BP                             TF                                  num_draws
1 CGGCGCATGTTCGGTAATGA           TFTTTFTTTFFTTFTTTTTF                6
2 ATAAGATGCCCAGAGCCTTTTCATGTACTA TFTFTFTFFFFFFTTFTTTTFTTTTFFTTT      9
3 TCTTAGGAAGGATTC                FTTTTTTTTTFFFFF                     4

desired output:

[1]GGCGC FTTTF 
   AATGA TTTTF 
   TTFFT TGTTC 
   TAATG TTTTT
   AATGA TTTTF   
   CGGCG TFTTT

[2]AGATG FTFTF
   ATAAG TFTFT
   ATGCC FTFFF
   GCCCA FFFFF
   ATAAG TFTFT
   GTACT TFFTT
   GCCCA FFFFF
   TGCCC TFFFF
   AGATG FTFTF

[3]TTAGG TTTTT
   CTTAG TTTTT
   GGAAG TTTTT
   GGATT TTFFF

example code:

#make example data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3)) 


#calculate number of draws by length 
frag_length<- 5 
my_df<- cbind(my_df, (round((nchar(my_df[,1]) / frag_length) * 1.5, digits = 0)))
colnames(my_df) <- c("BP", "TF", "num_draws")

#function to sample x number of draws per row (this does not work)
Fragment = function(string) {
  nStart = sample(1:(nchar(string) -5), 1)
  samp<- substr(string, nStart, nStart + 4)
replicate(n= string[,3], expr = samp)
  }


apply(my_df[,1:2], c(1,2), Fragment)

Solution

  • One option would be to change the function to have another argument n and create the nStart inside the replicate call

    Fragment = function(string, n) {
       replicate(n= n,  {nStart <- sample(1:(nchar(string) -5), 1)
                      samp <- substr(string, nStart, nStart + 4)
                  })   
    
    }
    
    apply(my_df, 1, function(x) data.frame(lapply(x[1:2], Fragment, n = x[3])))
    $`1`
    #     BP    TF
    #1 GGCGC FFTTF
    #2 GGTAA TFFTT
    #3 GCGCA TTFTT
    #4 CGCAT TFFTT
    #5 GGCGC FTTTF
    #6 TGTTC FTTFT
    
    #$`2`
    #     BP    TF
    #1 GTACT TTTTF
    #2 ATAAG FTTFT
    #3 GTACT TFTFF
    #4 TAAGA TTTTF
    #5 CCTTT FFTTF
    #6 TCATG TTTTF
    #7 CCAGA TFTFT
    #8 TTCAT TFTFT
    #9 CCCAG FTFTF
    
    #$`3`
    #     BP    TF
    #1 AAGGA TTTFF
    #2 AGGAT TTTTT
    #3 CTTAG TFFFF
    #4 TAGGA TTTFF