I'm trying to do some pattern matching with the extract
function from tidyr
. I've tested my regex in a regex practice site, the pattern seems to work, and I am using a lookbehind assertion
.
I have the following sample text:
=[\"{ Key = source, Values = web,videoTag,assist }\",\"{ Key = type,
Values = attack }\",\"{ Key = team, Values = 2 }\",\"{ Key =
originalStartTimeMs, Values = 56496 }\",\"{ Key = linkId, Values =
1551292895649 }\",\"{ Key = playerJersey, Values = 8 }\",\"{ Key =
attackLocationStartX, Values = 3.9375 }\",\"{ Key =
attackLocationStartY, Values = 0.739376770538243 }\",\"{ Key =
attackLocationStartDeflected, Values = false }\",\"{ Key =
attackLocationEndX, Values = 1.7897727272727275 }\",\"{ Key =
attackLocationEndY, Values = -1.3002832861189795 }\",\"{ Key =
attackLocationEndDeflected, Values = false }\",\"{ Key = lastModified,
Values = web,videoTag,assist
I want to grab the numbers following attackLocationX
(all numbers following any text about an attack location.
Using the following code with lookbehind assertion, however, I get no results:
df %>%
extract(message, "x_start",'((?<=attackLocationStartX,/sValues/s=/s)[0-
9.]+)')
This function will return NA
if no pattern match is found, and my target column is all NA
values despite having tested the pattern on www.regexr.com
. According to the documentation, R
pattern matching supports lookbehind assertions so I'm not sure what else to do here.
First of all, to match whitespace you need \s
, not /s
.
You do not have to use a lookbehind here, as the extract
will return captured substrings if capturing group(s) are used in the pattern.
Use
df %>%
extract(message, "x_start", "attackLocationStartX\\s*,\\s*Values\\s*=\\s*(-?\\d+\\.\\d+)")
Output: 3.9375
.
The regex may also look like "attackLocationStartX\\s*,\\s*Values\\s*=\\s*(-?\\d[.0-9]*)"
.
As the (-?\\d+\\.\\d+)
part is captured, only the text in this group will be the output.
Pattern details
(-?\d+\.\d+)
- a capturing group thst matches
-?
- an optional hyphen (?
means 1 or 0 occurrences)\d+
- 1 or or digits (+
means 1 or more)\.
- a dot\d+
- 1 or or digits\d[.0-9]*
- a digit (\d
), followed with 0 or more dots or digits ([.0-9]*
)