Search code examples
pythonpandasdataframenumpydata-cleaning

Find the categories that frequently occur together based on another column


Consider that I have the following data in a Pandas dataframe:

Paper ID Author ID
Paper_1 Author_1
Paper_1 Author_2
Paper_2 Author_2
Paper_3 Author_1
Paper_3 Author_2
Paper_3 Author_3
Paper_4 Author_1
Paper_4 Author_3

I need to find the number of non-zero collaborations. So, the output should be:
(Author_1,Author_2) --> 2
(Author_1,Author_3) --> 1

Any help or advice will be greatly appreciated.


Solution

  • If the data is fairly small, then merging on the Paper ID will generate pairs that can be collapsed/aggregated:

    # assume df has columns Paper ID, Author ID
    df_merged = df.merge(df, on="Paper ID")
    
    # keep only one instance of a collaboration
    mask = df_merged["Author ID_x"] > df_merged["Author ID_y"]
    
    # aggregate (note the use of the mask to avoid double-
    # counting and self-collaborations as noted in the
    # comment by Riccardo Bucco)
    counts = (
        df_merged[mask]
        .groupby(["Author ID_x", "Author ID_y"])
        .agg(collaboration_count=("Paper ID", "count"))
    )