Search code examples
rloopsdataframedna-sequence

Iterate over a dataframe then apply function on a dataset depending on same name in R


I think my answer is "split" and a for loop but as considerably new to R i cant really make it. So i have a dataframe as:

    row.names     start end length  transcript
1   NM_008866.1    22   714 693    NM_008866
2   NM_008866.2    125  196 72     NM_008866
3   NM_008866.3    129  242 114    NM_008866
... 
14  NM_001159750.37 221 1123903    NM_001159750
15  NM_001159750.40 453 557 105    NM_001159750
16  NM_001159750.41 570 644 75     NM_001159750
...

and a DNAStringset as:

A DNAStringSet instance of length 2
    width seq                                                         names               
[1]  2433 GCACTGTCCGCCAGCCGGTGGATGTGCG...TGTGAAATAAAATTTAATTTTGGCTTTA NM_008866
[2]  2668 ACTTCTACTTTCCAGTCTCCTGCGATCG...TCAATAAAGTTTTTTGTTGTTAAACATA NM_001159750

For every transcript name i want to apply a function (subseq()) on the right DNAstring set (right by name).The subseq function is going to take as arguments the start and stop columns of my dataframe iteratevily everytime.

For the moment: (think i should do some spliting on the dataframe and dataset right?)

results <- list()
for (myName in names(dataframe)){
  localdf<- dataframe[[myName]]
  localseqsplit <- dataset[[myName]]
  results<-subseq(localseqsplit,start=localdf$start,end=localdf$end)
  temp<-results[[myName]]
  return(temp)
 }

Solution

  • Since you don't have a reproducible example or a representative output here is by initial guess at what you are looking for.

    # make very basic workign example
    df <- read.table(header=T, text='
                         row.names     start end length  transcript
       NM_008866.1    10   18   8  NM_008866
       NM_008866.2    15  22 7     NM_008866
       NM_008866.3    19  28 9    NM_008866
      NM_001159750.37 5 22 17    NM_001159750
      NM_001159750.40 8 30 22    NM_001159750
     NM_001159750.41 12 32 20     NM_001159750')
    
    # create the DNAStringSet
    x0 <- c(NM_008866 = "GCACTGTCCGCCAGCCGGTGGATGTGCG", NM_001159750="ACTTCTACTTTCCAGTCTCCTGCGATCGAAGC")
    dna <- DNAStringSet(x0)
    
    # split your dataset by transcript name
    df_split <- split(df, f=df$transcript)
    
    results <- list()
    for(myName in names(dna)){
      # get the index of which transcript you are working with
      index <- which(names(dna) == myName)
    
      # make sure the transcript is in your dataset
      if(myName %in% names(df_split)){
        # loop through the possible start and end indices
        for(j in 1:nrow(df_split[[myName]])){
          # take the given dna string and create substrings from given indices
          dna_sub <- subseq(dna[index], start=df_split[[myName]]$start[j], end=df_split[[myName]]$end[j])
          # append results to list element with transcript name
          results[[myName]] <- append(results[[myName]], dna_sub)
        }
      }
    }
    results
    
    > results
    $NM_008866
      A DNAStringSet instance of length 3
        width seq         names               
    [1]     9 GCCAGCCGG   NM_008866
    [2]     8 CCGGTGGA    NM_008866
    [3]    10 TGGATGTGCG  NM_008866
    
    $NM_001159750
      A DNAStringSet instance of length 3
        width seq                      names               
    [1]    18 CTACTTTCCAGTCTCCTG       NM_001159750
    [2]    23 CTTTCCAGTCTCCTGCGATCGAA  NM_001159750
    [3]    21 CCAGTCTCCTGCGATCGAAGC    NM_001159750