Search code examples
rbioinformaticsfasta

How to iterate entries in a function to create two new character vectors


I am struggling to separate a single string input into a series of inputs. The user gives a list of FASTA formatted sequences (see example below). I'm able to separate the inputs into their own

ex:

">Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
.>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
"
[1] "Rosalind_6404CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG"    
[2] "Rosalind_5959CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC"

But I am struggling to find a way to create a function that splits the "Rosalind_6404" from the gene sequence to the unknown amount of FASTA sequences while creating new vectors for the split elements. Ultimately, the result would look something such as:

.> "Rosalind_6404" "Rosalind5959"
.> "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG","CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC"

I was hoping the convert_entries function would allow me to iterate over all the elements of the prepped_s character vector and split the elements into two new vectors with the same index number.

s <- ">Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC"

split_s <- strsplit(s, ">")
ul_split_s<- unlist(split_s)
fixed_s <- gsub("\n","", ul_split_s)
prepped_s <- fixed_s[-1]
prepped_s
nchar(prepped_s[2])
print(prepped_s[2])

entry_tags <- list()
entry_seqs <- list()

entries <- length(prepped_s)
unlist(entries)
first <- prepped_s[1]

convert_entries <- function() {
  for (i in entries) {
    tag <- substr(prepped_s[i], start = 1, stop = 13)
    entry_tags <- append(entry_tags, tag)
    return(entry_tags)
  } 
}
entry_tags <- convert_entries()
print(entry_tags)

Please help in any way you can, thanks!


Solution

  • One option with tidyverse

    library(dplyr)
    library(tidyr)
    library(stringr)
    tibble(col1 = s) %>% 
       separate_rows(col1, sep="\n") %>%
       group_by(grp = cumsum(str_detect(col1, '^>'))) %>%
       summarise(prefix = first(col1), 
                 col1 = str_c(col1[-1], collapse=""), .groups = 'drop') %>% 
       select(-grp)
    

    -output

    # A tibble: 2 x 2
      prefix           col1                                                                                
      <chr>          <chr>                                                                               
    1 >Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG    
    2 >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC