Search code examples
pythonpandasstringdataframestring-comparison

Pandas: What is the difference between isin() and str.contains()?


I want to know if a specific string is present in some columns of my dataframe (a different string for each column). From what I understand isin() is written for dataframes but can work for Series as well, while str.contains() works better for Series.

I don't understand how I should choose between the two. (I searched for similar questions but didn't find any explanation on how to choose between the two.)


Solution

  • .isin checks if each value in the column is contained in a list of arbitrary values. Roughly equivalent to value in [value1, value2].

    str.contains checks if arbitrary values are contained in each value in the column. Roughly equivalent to substring in large_string.

    In other words, .isin works column-wise and is available for all data types. str.contains works element-wise and makes sense only when dealing with strings (or values that can be represented as strings).

    From the official documentation:

    Series.isin(values)

    Check whether values are contained in Series. Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.


    Series.str.contains(pat, case=True, flags=0, na=nan,** **regex=True)

    Test if pattern or regex is contained within a string of a Series or Index.

    Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

    Examples:

    print(df)
    #     a
    # 0  aa
    # 1  ba
    # 2  ca
    
    print(df[df['a'].isin(['aa', 'ca'])])
    #     a
    # 0  aa
    # 2  ca
    
    print(df[df['a'].str.contains('b')])
    #     a
    # 1  ba