Search code examples
rregexpattern-matchinggenetic-algorithmgenetic-programming

Regular expression to catch all possibilities?


My input is a genetic data that looks like this:

SNP       VALUE
rs123456  A/G
rs345353  del/CTT
rs343524  T
rs243224  T/del
....

Without getting deeply into genetics, all of us have 2 alleles (mom and dad) so if you have single value without "/" (A/C/G/T/del/CTT) that means both alleles are the same, if not, there is slash "/" to show they are different.

Long story short, I need to find known patterns of the SNP's but I understand that there are a lot of possibilities (if number of / (slashed) values is large).

I have already built regular expression like this: [A|C|G|T|del|CTT].

A/G = G/A so I need to match all possibilities.

Is there any function or logic that can help me to do this? Please advise.

P.S

Adding more info:

The expected output is all possible variants of the values for example:

rs123 = A/G, rs456 = T/C, rs789 = CTT: 
Option 1: A T CTT; 
Option 2: A C CTT; 
Option 3: G T CTT; 
Option 4: G C CTT; 

but if I have more then 2 / I want to get all the options.


Solution

  • If I understand correctly you are after this:

    df = data.frame(SNP = c("rs123456",  "rs345353", "rs343524" ,"rs243224"),
                    value = c("A/G", "del/CTT", "T", "T/del"), stringsAsFactors = F)
    
    expand.grid(strsplit(df$value, "/"))
    #output
      Var1 Var2 Var3 Var4
    1    A  del    T    T
    2    G  del    T    T
    3    A  CTT    T    T
    4    G  CTT    T    T
    5    A  del    T  del
    6    G  del    T  del
    7    A  CTT    T  del
    8    G  CTT    T  del
    

    or if a string is required per combination

    apply(expand.grid(strsplit(df$value, "/")), 1, paste, collapse = " ")
    #output
    [1] "A del T T"   "G del T T"   "A CTT T T"   "G CTT T T"   "A del T del" "G del T del"
    [7] "A CTT T del" "G CTT T del"
    

    or:

    do.call(paste, c(expand.grid(strsplit(df$value, "/")), sep=" "))