I would like to know if there is a way to write a Fasta alignment file into R in order to get each nucleotide of each sequence in one column? As for example:
>sample1
atgc
>sample2
aagc
I would like to get 5 columns in R, first column the sample name and then each column the nucleotide from each sample.
sample1 a t g c
sample2 a a g c
Is this possible?
First you need to read in your file. The readLines()
converts a file into a character vector with one element for each line. Assuming the file only contains data of the type shown in your question, you can use:
file_lines <- readLines("\your\file\path.file")
Then, functions from dplyr
, stringr
, and tidyr
can help you clean up your data.
library(dplyr)
library(stringr)
library(tidyr)
matrix(file_lines, ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
rename(sample = V1) %>%
mutate(sample = str_remove(sample, ">")) %>%
separate(V2, into = paste0(".", 1:4), sep = 1:4)
sample .1 .2 .3 .4
1 sample1 a t g c
2 sample2 a a g c