Search code examples
pythonstringlistdataframecontains

str.contains() when filtering a df with a list, but some of the list items contain 2 or more words


I am filtering a column using a list and I have been using

str.contains("".format("|".join(towns))

This works on towns like "Atlanta", but not "New York" as it is searching for New and York seperately. Is there a way around this?

Reproducible example - They all return True:

array = ["New Jersey", "Atlanta", "New York", "Washington"]
df = pd.DataFrame({"col1": array})

towns = ["Atlanta", "New York"]

df["col1"].str.contains("".format("|".join(towns)))

Solution

  • For your example data Series.isin works.

    >>> df["col1"].isin(towns)
    0    False
    1     True
    2     True
    3    False
    Name: col1, dtype: bool
    

    If The Series is a bit different and you need to use a regular expression:

    >>> dg = pd.DataFrame({"col1": ["New Jersey","Atlanta","New York",
                                    "Washington", "The New York Times"]})
    >>> dg
                     col1
    0          New Jersey
    1             Atlanta
    2            New York
    3          Washington
    4  The New York Times
    >>>
    >>> rex = "|".join(towns)
    >>> dg['col1'].str.contains(rex)
    0    False
    1     True
    2     True
    3    False
    4     True
    Name: col1, dtype: bool
    

    >>> df
             col1
    0  New Jersey
    1     Atlanta
    2    New York
    3  Washington