Search code examples
rregexstrsplit

Parsing String and splitting it in R


I have somehow a regex problem with handling strings in R.

I have data structure provided by RNAfold software that looks like this:

"....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."

This is a typical secondary structure for miRNAs, but I also have other sequences that are not miRNAs, that look somwhat like this:

...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......

This second sequence has two hairpin loops, one at the beginning and another one in the middle, whereas the first sequence just has one hairpin loop in the middle.

Dots (".") represent nucleotides that are not paired, while "(" represent nucleotides that are paired with their counterparts, represented as ")".

I want to split this string so that I can get the stems in the structure.

The output I would like to obtain is:

Input:

[1] "....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."

Output:

[1] "....(((..((((((((.(((((((((((........."
[2] "))))))))))).))))))))..))).."

So that I can count the number of splited strings and count the number of stems.

The result for the second sequence would be:

Input:

[1] ...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......

Output:

[1] "...((((....."
[2] "))))...........(((((((...((..(((..((((...((((((....."
[3] ")))).))...)))).))).))...)))))))......."

So in esence, what I want is to parse the strings, so that they are splitted when they fin a ")" symbol, conserving all the symbols of the string.

I have been tried using strplit() and some regex variations but I haven't been able to find the trick...

Any help?

Thanks


Solution

  • You could do a lookahead and look for dots ending by a closing parenthesis which come straight after an opening parenthesis.

    x <- c("....(((..((((((((.(((((((((((..))))))))))).))))))))..)))..", 
           "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
    strsplit(x, "\\((?=(\\.+\\)))", perl = TRUE)
    # [[1]]
    # [1] "....(((..((((((((.(((((((((("  "..))))))))))).))))))))..))).."
    # 
    # [[2]]
    # [1] "...((("  ".....))))...........(((((((...((..(((..((((...((((("
    # [3] ".....)))).))...)))).))).))...)))))))......."