Search code examples
rregexlookbehind

Regex in R lookbehind assertion


I'm trying to do some pattern matching with the extract function from tidyr. I've tested my regex in a regex practice site, the pattern seems to work, and I am using a lookbehind assertion.

I have the following sample text:

=[\"{ Key = source, Values = web,videoTag,assist }\",\"{ Key = type, 
Values = attack }\",\"{ Key = team, Values = 2 }\",\"{ Key = 
originalStartTimeMs, Values = 56496 }\",\"{ Key = linkId, Values = 
1551292895649 }\",\"{ Key = playerJersey, Values = 8 }\",\"{ Key = 
attackLocationStartX, Values = 3.9375 }\",\"{ Key = 
attackLocationStartY, Values = 0.739376770538243 }\",\"{ Key = 
attackLocationStartDeflected, Values = false }\",\"{ Key = 
attackLocationEndX, Values = 1.7897727272727275 }\",\"{ Key = 
attackLocationEndY, Values = -1.3002832861189795 }\",\"{ Key = 
attackLocationEndDeflected, Values = false }\",\"{ Key = lastModified, 
Values = web,videoTag,assist 

I want to grab the numbers following attackLocationX (all numbers following any text about an attack location.

Using the following code with lookbehind assertion, however, I get no results:

df %>% 
extract(message, "x_start",'((?<=attackLocationStartX,/sValues/s=/s)[0- 
9.]+)')

This function will return NA if no pattern match is found, and my target column is all NA values despite having tested the pattern on www.regexr.com. According to the documentation, R pattern matching supports lookbehind assertions so I'm not sure what else to do here.


Solution

  • First of all, to match whitespace you need \s, not /s.

    You do not have to use a lookbehind here, as the extract will return captured substrings if capturing group(s) are used in the pattern.

    Use

    df %>% 
      extract(message, "x_start", "attackLocationStartX\\s*,\\s*Values\\s*=\\s*(-?\\d+\\.\\d+)")
    

    Output: 3.9375.

    The regex may also look like "attackLocationStartX\\s*,\\s*Values\\s*=\\s*(-?\\d[.0-9]*)".

    As the (-?\\d+\\.\\d+) part is captured, only the text in this group will be the output.

    Pattern details

    • (-?\d+\.\d+) - a capturing group thst matches
      • -? - an optional hyphen (? means 1 or 0 occurrences)
      • \d+ - 1 or or digits (+ means 1 or more)
      • \. - a dot
      • \d+ - 1 or or digits
    • \d[.0-9]* - a digit (\d), followed with 0 or more dots or digits ([.0-9]*)