I am trying to extract set of keywords such as ['lemon', 'apple', 'coconut'] etc. from the paths such as "\var\prj\lemon_123\xyz", "\var\prj\123_apple\coconut", "\var\prj\lemonade\coconutapple", "\var\prj\apple\lemon"
The expected output is little complex:
Paths | MatchedKeywords |
---|---|
"/var/prj/lemon_123/xyz" | lemon |
"/var/prj/123_apple/coconut" | apple, coconut |
"/var/prj/lemonade/coconutapple" | |
"/var/prj/apple/lemon" | apple, lemon |
keep in mind that the third row does not have the exact word which start with /, \s, \d or _ thats why there is no match. The regular expression is kind of like this: \s\d_/[\s\d_/]. I tried using:
df['Paths'].str.findall(r'[^\s\d_/]lemon|apple|coconut[\s\d_/$]', flags=re.IGNORECASE)
But it is still showing 'lemon' and 'coconut' in the third row.
Thank you in advance.
You can use
df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
df['Paths'].str.findall(r'(?<![^\W\d_])(?:lemon|apple|coconut)(?![^\W\d_])').str.join(", ")
See the regex demo (and regex demo #2), the regex matches
(?<![^\W_])
- a location that is not immediately preceded with a char other than a non-word char and an underscore (it is a left-hand word boundary with the _
subtracted from it)(?:lemon|apple|coconut)
- a non-capturing group matching any of the words inside the group(?![^\W_])
- a location that is not immediately followed with a char other than a non-word char and an underscore (it is a right-hand word boundary with the _
subtracted from it).If you use (?<![^\W\d_])
and (?![^\W\d_])
your word boundaries will be letter boundaries, i.e. it will be \b
with digits and underscore subtracted from it.
See the Python demo:
import pandas as pd
df = pd.DataFrame({"Paths":["/var/prj/lemon_123/xyz", "/var/prj/123_apple/coconut", "/var/prj/lemonade/coconutapple", "/var/prj/apple/lemon"]})
df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
# 0 lemon
# 1 apple, coconut
# 2
# 3 apple, lemon
# Name: Paths, dtype: object