Search code examples
rinserttail

Extract specific number of characters from every row and insert letters conditionally in R


I'm dealing with DNA sequencing data, which I need to extract specific numbers of nucleotides from matrix.

Dataset looks like this

1 "GCGGGCGGGGCGGGGTCTTGTGTGGGCTCAGC"
2 "GCAGTAA"
3 "GAACAGTGGCCGGAGCGTCT"
.... (Many many rows)

From each row, (1) I would like to extract 10 nucleotides from 'tail', (2) and want to introduce dummy letter 'Z's at the beginning to make total of 10 nucleotides, only when the nucleotide length was shorter than 10nts.

The final results should look like this.

1 "TGGGCTCAGC"
2 "ZZZGCAGTAA"
3 "CGGAGCGTCT"
.... (Many many rows)

First I tried 'tail' function to try to extract very last nucleotides

tail(mydata, n=10)

but this returns 10 rows from end of mydata matrix, not the 10 nucleotides. Is there any ways to achieve it using R?

Thank you very much for your help


Solution

  • tail() is not the right function for this job because it looks at elements. What you want are functions that look at the characters inside of each element.

    I presume you have many nucleotides to be processed, so I recommend you use the very efficient stringi package. In the following code, matrix() is only necessary if you want a matrix result. Otherwise a character vector will be returned.

    library(stringi)
    matrix(stri_pad(stri_sub(m, -10L), 10L, pad = "Z"))
    #      [,1]        
    # [1,] "TGGGCTCAGC"
    # [2,] "ZZZGCAGTAA"
    # [3,] "CGGAGCGTCT"
    

    where m is the original data

    m <- matrix(
        c("GCGGGCGGGGCGGGGTCTTGTGTGGGCTCAGC", "GCAGTAA", "GAACAGTGGCCGGAGCGTCT")
    )