I have a FASTQ quality score which is presented as a series of ASCII characters. In this case (likely) ASCII character 64 to 126 represent the a score of 0 to 62 (presuming it is Illumina). This gives rise to underlying sequence :
feffefdfbefdfffcfdeTddaYddffbfcI``S_KKX_]]MR[D_TY[VTVXQ]`Q_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
How do I extract which is the number of the ASCII characters?
Thank you San
EDIT: This sequence denotes the quality of a biological sequence that is made up of bases (from base pairs in nucleic acids, meaning a character (ATGC)). A base quality is the phred-scaled base error probability which equals -10 log10 Pr{base is wrong}.
Well, as Marek said : you might find a function to convert Illumina quality scores in Bioconductor. You can ask at biostar.stackexchange.com.
Using base functions, you can use charToRaw()
:
> x <- "feeffdbefc`\\KKX]_BBBB"
> charToRaw(x)
[1] 66 65 65 66 66 64 62 65 66 63 60 5c 4b 4b 58 5d 5f 42 42 42 42
> as.numeric(charToRaw(x))
[1] 102 101 101 102 102 100 98 101 102 99 96 92 75 75 88 93 95 66 66 66 66
> as.character(charToRaw(x))
[1] "66" "65" "65" "66" "66" "64" "62" "65" "66" "63" "60" "5c" "4b" "4b" "58" "5d" "5f" "42" "42" "42" "42"
Mind you, you'll have to escape the backslash, or you'll get into trouble. That depends on how you read in your data and so forth.