I want to know if a specific string is present in some columns of my dataframe (a different string for each column).
From what I understand isin()
is written for dataframes but can work for Series as well, while str.contains()
works better for Series.
I don't understand how I should choose between the two. (I searched for similar questions but didn't find any explanation on how to choose between the two.)
.isin
checks if each value in the column is contained in a list of arbitrary values. Roughly equivalent to value in [value1, value2]
.
str.contains
checks if arbitrary values are contained in each value in the column. Roughly equivalent to substring in large_string
.
In other words, .isin
works column-wise and is available for all data types. str.contains
works element-wise and makes sense only when dealing with strings (or values that can be represented as strings).
From the official documentation:
Check whether values are contained in Series. Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
Series.str.contains(pat, case=True, flags=0, na=nan,** **regex=True)
Test if pattern or regex is contained within a string of a Series or Index.
Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.
Examples:
print(df)
# a
# 0 aa
# 1 ba
# 2 ca
print(df[df['a'].isin(['aa', 'ca'])])
# a
# 0 aa
# 2 ca
print(df[df['a'].str.contains('b')])
# a
# 1 ba