Search code examples
pythonstringtextseries

Check if a pandas.core.series.Series contains a specific string in Python


I have a dataframe (df1) that looks like this:

first_name last_name affiliation
jean dulac University of Texas
peter huta University of Maryland

I want to match this dataframe to another one that contains several potential matches. Each potential match has a first and last name and also a list of all the affiliations this person was associated with, and I want to use the information in this affiliation column to differentiate between my potential matches and keep only the most likely one.

The second dataframe has the following form:

first_name last_name affiliations_all
jean dulac [{'city_name': 'Kyoto', 'country_name': 'Japan', 'name': 'Kyoto University'}]
jean dulac [{'city_name': 'Texas', 'country_name': 'USA', 'name': 'University of Texas'}]

The column affiliations_all is apparently saved as a pandas.core.series.Series (and I can't change that since it comes from an API query).

I am thinking that one way to match the 2 dataframes would be to remove the words like "university" and "of" from the affiliation column of the first dataframe (that's easy), do the same for the affiliations_all column of the second dataframe (don't know how to do that) and then run some version of

test.apply(lambda x: str(x.affiliation) in str(x.affiliations_all), axis=1)

adapted to the fact that affiliations_all is a series.Series.

Any idea how to do that?

Thanks!


Solution

  • One possible solution would be to transform df2 (expand the columns) and then merge df1 with df2:

    # transform df2
    df2 = df2.explode("affiliations_all")
    df2 = pd.concat([df2, df2.pop("affiliations_all").apply(pd.Series)], axis=1)
    df2 = df2.rename(columns={"name": "affiliation"})
    
    print(df2)
    

    This prints:

      first_name last_name city_name country_name          affiliation
    0       jean     dulac     Kyoto        Japan     Kyoto University
    1       jean     dulac     Texas          USA  University of Texas
    

    And the seconds step will be merge df1 with transformed df2:

    df_out = pd.merge(df1, df2, on=["first_name", "last_name", "affiliation"])
    
    print(df_out)
    

    Prints:

      first_name last_name          affiliation city_name country_name
    0       jean     dulac  University of Texas     Texas          USA