Search code examples
pandasstringcontains

str.contains doesn't find partial matches


In a dataframe

df = pd.DataFrame({'colA': ['id1', 'id2', 'id3', 'id4', 'id5'],
                   'colB': ['Black cat', 'Black mouse', 'Black_A cat', 'Black cat', 'White_A mouse']})

I want to find all the lines where colB contains Black cat. My command

df[df['colB'].str.contains('Black cat', na=False)]

allows to find only

colA    colB
0   id1 Black cat
3   id4 Black cat

while I expect this:

    colA    colB
0   id1 Black cat
2   id3 Black_A cat
3   id4 Black cat

What is wrong with partial matches?


Solution

  • What's partial match in your case? contains is to find exact substrings, so Black A cat wouldn't match Black cat. If you expect optional characters in between Black and cat you should specify that in the pattern:

    df[df['colB'].str.contains('Black.*cat', na=False)]
    #                                ^ this
    

    Output:

      colA         colB
    0  id1    Black cat
    2  id3  Black_A cat
    3  id4    Black cat