Search code examples
pythonpandascontains

Select rows that match exactly with string contain


I have a dataset where I try to select only the rows, which exactly match de defined strings in the list.

list  = ['P:34.', 'R:ES.'] 

df = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
    'Code':['P:34. R:ES.', 'R:ESB.', 'K2P:P:341.', 'R:ESZ', 'P:34.', 'R.ES7.', 'P 34 5', 'P:32. R:ES.'], 
    'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]}) 

I used the function str.contains to select the rows accordingly but with that, I get rows with do not match the strings exactly.

sample = df[df.Code.str.contains('|'.join(list),na=False)]

I try to get only the rows that contain exactly the strings (also considering the point at the end of the strings) in the list which would be something like that:

df_exact_match = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
    'Code':['P:34. R:ES.', 'P:34.', 'P:32. R:ES.'], 
    'Ratings':[9.0, 2, 5]})

Thank you very much for your advice :)


Solution

  • I get rows with do not match the strings exactly.

    That happens because by default Series.str.contains assumes that the first argument is a regex pattern, and in regex the dot . matches any single character. To match the literal . you have to escape it (i.e. \.). There is no need to specify na=False btw.

    >>> l  = ['P:34\.', 'R:ES\.'] 
    >>> df[df.Code.str.contains('|'.join(l))]
    
             Date         Code  Ratings
    0  2021-01-01  P:34. R:ES.      9.0
    4  2021-01-02        P:34.      2.0
    7  2021-01-03  P:32. R:ES.      5.0