Search code examples
pythonregexpandasdataframefindall

Pandas extract information which starts with [\s\d_/] and ends in [\s\d_/]


I am trying to extract set of keywords such as ['lemon', 'apple', 'coconut'] etc. from the paths such as "\var\prj\lemon_123\xyz", "\var\prj\123_apple\coconut", "\var\prj\lemonade\coconutapple", "\var\prj\apple\lemon"

The expected output is little complex:

Paths MatchedKeywords
"/var/prj/lemon_123/xyz" lemon
"/var/prj/123_apple/coconut" apple, coconut
"/var/prj/lemonade/coconutapple"
"/var/prj/apple/lemon" apple, lemon

keep in mind that the third row does not have the exact word which start with /, \s, \d or _ thats why there is no match. The regular expression is kind of like this: \s\d_/[\s\d_/]. I tried using:

df['Paths'].str.findall(r'[^\s\d_/]lemon|apple|coconut[\s\d_/$]', flags=re.IGNORECASE)

But it is still showing 'lemon' and 'coconut' in the third row.

Thank you in advance.


Solution

  • You can use

    df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
    df['Paths'].str.findall(r'(?<![^\W\d_])(?:lemon|apple|coconut)(?![^\W\d_])').str.join(", ")
    

    See the regex demo (and regex demo #2), the regex matches

    • (?<![^\W_]) - a location that is not immediately preceded with a char other than a non-word char and an underscore (it is a left-hand word boundary with the _ subtracted from it)
    • (?:lemon|apple|coconut) - a non-capturing group matching any of the words inside the group
    • (?![^\W_]) - a location that is not immediately followed with a char other than a non-word char and an underscore (it is a right-hand word boundary with the _ subtracted from it).

    If you use (?<![^\W\d_]) and (?![^\W\d_]) your word boundaries will be letter boundaries, i.e. it will be \b with digits and underscore subtracted from it. See the Python demo:

    import pandas as pd
    df = pd.DataFrame({"Paths":["/var/prj/lemon_123/xyz", "/var/prj/123_apple/coconut", "/var/prj/lemonade/coconutapple", "/var/prj/apple/lemon"]})
    df['Paths'].str.findall(r'(?<![^\W_])(?:lemon|apple|coconut)(?![^\W_])').str.join(", ")
    #  0             lemon
    #  1    apple, coconut
    #  2                  
    #  3      apple, lemon
    #  Name: Paths, dtype: object