r bioinformatics vcf-variant-call-format

Deconstruct DNAstringsSets into normal strings

This comes from an R library called "VariantAnnotation" and its dependency "Biostrings"

I have a DNAstringsSetList and I want to transform it into a normal list or a vector of strings.

library(VariantAnnotation)

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")

vcf <- readVcf(fl, "hg19")

tempo <- rowRanges(vcf)$ALT  # Here is the DNAstringsSetList I mean.

print(tempo)

A DNAStringSet instance of length 10376
    width seq
[1]     1 G
[2]     1 T
[3]     1 A
[4]     1 T
[5]     1 T
...   ... ...
[10372]     1 G
[10373]     1 G
[10374]     1 G
[10375]     1 A
[10376]     1 C

tempo[[1]]
A DNAStringSet instance of length 1
width seq
[1]     1 G

But I don't want this format. I just want strings of the bases, in order to insert them as a column in a new dataframe. I want this:

G
T
A
T
T

I have accomplished this with this package method:

as.character(tempo@unlistData)

However, it returns 10 rows more than tempo has! The head and tail of this result and of tempo are exactly the same, so somewhere in the middle there are 10 extra rows that should not have been formed (not NAs)

Solution

You can call as.character on either a DNAString or a DNAStringSet.

as.character(tempo[1 : 5])
# [1] "G" "T" "A" "T" "T"