Search code examples
ralignmentfasta

Read Fasta alignment file into R in order to get each nucleotide from several sequences in one column


I would like to know if there is a way to write a Fasta alignment file into R in order to get each nucleotide of each sequence in one column? As for example:

>sample1
atgc
>sample2
aagc

I would like to get 5 columns in R, first column the sample name and then each column the nucleotide from each sample.

sample1 a t g c
sample2 a a g c

Is this possible?


Solution

  • First you need to read in your file. The readLines() converts a file into a character vector with one element for each line. Assuming the file only contains data of the type shown in your question, you can use:

    file_lines <- readLines("\your\file\path.file")
    

    Then, functions from dplyr, stringr, and tidyr can help you clean up your data.

    library(dplyr)
    library(stringr)
    library(tidyr)
    
    matrix(file_lines, ncol = 2, byrow = TRUE) %>%
      as.data.frame() %>%
      rename(sample = V1) %>%
      mutate(sample = str_remove(sample, ">")) %>%
      separate(V2, into = paste0(".", 1:4), sep = 1:4)
    
       sample .1 .2 .3 .4
    1 sample1  a  t  g  c
    2 sample2  a  a  g  c