Search code examples
pythonstringmatchcpu-wordfindall

How to extract exact words from a string using Python and re?


The data sample is:

a=pd.DataFrame({'Strings':['i xxx iwantto iii i xxx i',
                           'and you xxx and x you xxxxxx and you and you']})
b=['i','and you']

There are two words (phases) in b. I want to find them in a. I want to find the exact words, instead of substrings. So, I want the result to be:

['i' ,'i' ,'i']
['and you',' and you ',' and you']

I need to count how many times these words occur in a string. So I do not really need the above lists. I put it here because I want to show I want to find the exact words in the strings. Here is my try:

s='r\'^'+b[0]+' | '+b[0]+' | '+b[0]+'$\''
len(re.findall(s,a.loc[0,'Strings']))

I hope s can find the words in the beginning, in the middle and at the end. I have a big a and b. So I cannot just use the real string in here. But the result is:

len(re.findall(s,a.loc[0,'Strings']))
Out[110]: 1
re.findall(s,a.loc[0,'Strings'])
Out[111]: [' i ']

Looks like only the middle one is matched and found. I am not sure where I went wrong.


Solution

  • a=pd.DataFrame({'Strings':['i xxx iwantto iii i xxx i',
                               'and you xxx and x you xxxxxx and you and you']})
    print(a.Strings.str.findall('i |and you'))
    

    Output

    0                   [i , i , i ]
    1    [and you, and you, and you]
    Name: Strings, dtype: object
    

    print(a.Strings.str.findall('{} |{}'.format(*b)))