Search code examples
pythonpandassubstring

How can I isolate a particular word/words in a string using Python?


I have a dataframe where in one of the columns I only want to keep a subset of the string. In the example below I only want to keep the peoples names.

**Example: **

column 1
1.Joe Smith, NYC(212)
2.Jane Doe, HOU(713)

To remove everything left of the name I have used df['column1'] = df['column1'].str.lstrip("0123456789.")

This worked successfully. But isloltating the name from the comma onward is what I can't figure out. Not sure if RegEx would be better suited here?

Thanks!


Solution

  • Try with regex to extract names,

    df['column1'].str.extract(r'\d+\.(.+?),')
    

    Output:

    0   Joe Smith
    1   Jane Doe
    

    More details on pattern,

    • \d+: Match one or more digits.
    • \.: Match a period (dot) character.
    • (.+?): Capture one or more characters (non-greedy) into a group.
    • ,: Match a comma character.