Search code examples
regexrstrsplit

Splitting a string using lookahead assertion regex


Here is a string:

[1] "5 15  3 23 11 59 44.7 -.263226218521e-03  .488853402202e-11  .000000000000e+01"

I need to split it by certain spaces keeping first 7 numbers together, like this:

[1] "5 15  3 23 11 59 44.7" "-.263226218521e-03"  ".488853402202e-11"  ".000000000000e+01"

So I'm trying to use a lookahead regex to split by spaces that are followed by a dot or a minus sign:

strsplit(mystring,"(?=[-.]) +",perl=T)

or

strsplit(nraw,"(?=[-.])\\s+",perl=T)

But the regex does not match anywhere, and the original string is output.

What am I doing wrong?


Solution

  • If you want to split on spaces that are followed by a - or ., then you need to use a lookahead after you find the space.

    strsplit(mystring, " +(?=[-.])", perl=TRUE)
    #[[1]]
    #[1] "5 15  3 23 11 59 44.7" "-.263226218521e-03"   ".488853402202e-11"   
    #[4] ".000000000000e+01"
    

    Note that it is considered good practice to use the reserved word TRUE (i.e. it can't be redefined) instead of T, which can be redefined.


    If for some reason you want to put the lookahead first, then you would need to match both the space(s) and the [-.] inside the lookahead, then match those same space(s) outside of the regex:

    strsplit(mystring, "(?= +[-.]) +", perl=TRUE)
    [[1]]
    [1] "5 15  3 23 11 59 44.7" "-.263226218521e-03"    ".488853402202e-11"    
    [4] ".000000000000e+01" 
    

    This works because the lookahead is zero-width, meaning it doesn't actually consume those characters or move forward from the initial match position. You stay right at the match's beginning, which allows you to match those same spaces again outside of the lookahead.


    Your original approach doesn't work because of the zero-width nature of the lookahead. You are essentially asking to lookahead of the current position, without actually moving forward, to see if there is a . or -. Then, if found, look in that same spot for one or more spaces. There can't be a space in the spot where you found a . or -.