Read Fasta alignment file into R in order to get each nucleotide from several sequences in one column

I would like to know if there is a way to write a Fasta alignment file into R in order to get each nucleotide of each sequence in one column? As for example:

>sample1
atgc
>sample2
aagc

I would like to get 5 columns in R, first column the sample name and then each column the nucleotide from each sample.

sample1 a t g c
sample2 a a g c

Is this possible?

Solution

First you need to read in your file. The readLines() converts a file into a character vector with one element for each line. Assuming the file only contains data of the type shown in your question, you can use:

file_lines <- readLines("\your\file\path.file")

Then, functions from dplyr, stringr, and tidyr can help you clean up your data.

library(dplyr)
library(stringr)
library(tidyr)

matrix(file_lines, ncol = 2, byrow = TRUE) %>%
  as.data.frame() %>%
  rename(sample = V1) %>%
  mutate(sample = str_remove(sample, ">")) %>%
  separate(V2, into = paste0(".", 1:4), sep = 1:4)

   sample .1 .2 .3 .4
1 sample1  a  t  g  c
2 sample2  a  a  g  c