Search code examples
pythonpandasdataframecomparerow

Compare two dataframes and retrieve common row elements


I need to compare two datasets:

DF1

       Subj             1           2           3
0   Biotech   Cell culture     Bioinfo  Immunology
1   Zoology   Cell culture  Immunology         NaN
2      Math   Trigonometry     Algebra         NaN
3  Microbio        Biotech         NaN         NaN
4   Physics         Optics         NaN         NaN

DF2

       Subj             1           2           
0   Biotech       Bioinfo  Immunology         
1   Zoology    Immunology      Botany                  
2  Microbio         NaN           NaN         
3   Physics        Optics  Quantumphy
4      Math  Trigonometry         NaN         

How I want my result dataframe:

       Subj             1           2          
0   Biotech       Bioinfo  Immunology         
1   Zoology    Immunology         NaN         
2      Math  Trigonometry         NaN         
3   Physics        Optics         NaN         

I can't check row by row as the datasets are huge. The number of columns varies for both datasets, but rows are the same in number. Since the order of the row elements also vary, I can't simply use merge(). I tried compare function, but it either removes all common elements or forms a dataframe containing both. I can't seem to pick out just the common elements.


Solution

  • here is one way to do it

    Understanding: number of column varies and and values in two DF are not under same column

    # Stack both the DFs, after setting Subj as index
    # this results in changing a wide format to long format
    # concat the two DF to forma new DF
    
    df3=pd.concat([df.set_index('Subj').stack().reset_index().rename(columns={0:'val'}),
              df2.set_index('Subj').stack().reset_index().rename(columns={0:'val'})],
              ).reset_index()
    
    
    # to find the same topic under a subject if it exists in two DFs
    # the join will have duplicate rows
    
    # so find the duplicated rows for Subj and Topic (val column)
    # group the duplicated rows and aggregate to a comma separated values
    # finally split on comma to create new columns
    
    out=(df3[df3.duplicated(subset=['Subj','val'])]
     .groupby('Subj')['val']
     .agg(','.join)
     .str
     .split(',',expand=True).reset_index())
    out
    
        Subj        0             1
    0   Biotech     Bioinfo       Immunology
    1   Math        Trigonometry  None
    2   Physics     Optics        None
    3   Zoology     Immunology    None