My input is a genetic data that looks like this:
SNP VALUE
rs123456 A/G
rs345353 del/CTT
rs343524 T
rs243224 T/del
....
Without getting deeply into genetics, all of us have 2 alleles (mom and dad) so if you have single value without "/" (A/C/G/T/del/CTT) that means both alleles are the same, if not, there is slash "/" to show they are different.
Long story short, I need to find known patterns of the SNP's but I understand that there are a lot of possibilities (if number of / (slashed) values is large).
I have already built regular expression like this: [A|C|G|T|del|CTT]
.
A/G = G/A so I need to match all possibilities.
Is there any function or logic that can help me to do this? Please advise.
P.S
Adding more info:
The expected output is all possible variants of the values for example:
rs123 = A/G, rs456 = T/C, rs789 = CTT:
Option 1: A T CTT;
Option 2: A C CTT;
Option 3: G T CTT;
Option 4: G C CTT;
but if I have more then 2 / I want to get all the options.
If I understand correctly you are after this:
df = data.frame(SNP = c("rs123456", "rs345353", "rs343524" ,"rs243224"),
value = c("A/G", "del/CTT", "T", "T/del"), stringsAsFactors = F)
expand.grid(strsplit(df$value, "/"))
#output
Var1 Var2 Var3 Var4
1 A del T T
2 G del T T
3 A CTT T T
4 G CTT T T
5 A del T del
6 G del T del
7 A CTT T del
8 G CTT T del
or if a string is required per combination
apply(expand.grid(strsplit(df$value, "/")), 1, paste, collapse = " ")
#output
[1] "A del T T" "G del T T" "A CTT T T" "G CTT T T" "A del T del" "G del T del"
[7] "A CTT T del" "G CTT T del"
or:
do.call(paste, c(expand.grid(strsplit(df$value, "/")), sep=" "))