Search code examples
pythonpandasfilterkeyword-search

How to filter strings if the first three sentences contain keywords


I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article. I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.

(in the string, sentences are separated by \n. An example article looks like this:)

\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......

Solution

  • First we define a function to return a boolean based on whether your keywords appear in a given sentence:

    def contains_covid_kwds(sentence):
        kw1 = 'COVID19'
        kw2 = 'China'
        kw3 = 'Chinese'
        return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
    

    Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.

    Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):

    series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
    

    Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:

    filtered_df = df.loc[series]