Search code examples
pythonpandasdataframepivotindex-match

Fill matrix with value by searching for index and column names in other DataFrame


I have a "empty" data frame looking as follows:

        6807    6809    5341
126293  nan     nan     nan
126294  nan     nan     nan     
126295  nan     nan     nan     

The column names give me an name_id whereas the index values give me a file_id. Now I want to search for the file_id and the name_id in separate pandas data frames named pro, cont, and neutral which look like this:

    file_id name_id
0   126293  7244
1   126293  4978
2   126293  5112
3   126293  6864

If I find the file_idand name_idin the prodataframe I want to fill the empty data frame cell above with 1, when found in cont then -1 when in neutral, then the value entered into the matrix should be 0. Giving me a result like this, e.g.:

        6807    6809    5341
126293  1       -1     0
126294  0       -1     0        
126295  1       -1     1        

Does someone know how to get this done?


Solution

  • You can stack your 'empty' df (let's call it df) and merge against a combination of pro, con and neu. Then you can re-arrange it back into a 2d shape

    Put the votes together into one dataframe:

    votes = pd.concat([pro.assign(v=1), con.assign(v=-1), neu.assign(v=0)])
    votes['name_id'] = votes['name_id'].astype(str) # you may or may not have to do this depending on what type your actual df is, as I have no way of knowing. It should match the type from columns in the empty df
    

    votes now look like this (made up numbers by me):

        file_id name_id v
    0   126293  6807    1
    1   126293  4978    1
    2   126293  5112    1
    3   126293  6864    1
    0   126295  6809    -1
    0   126294  5341    0
    

    Now we merge it to a stacked df on name_id and file_id:

    df1  = (df.unstack()
                .reset_index()
                .merge(votes, left_on = ['level_0','level_1'], 
                    right_on = [ 'name_id','file_id'], how='left')[['level_0', 'level_1', 'v']]
    )
    

    df1 looks like

    
        level_0 level_1 v
    0   6807    126293  1.0
    1   6807    126294  NaN
    2   6807    126295  NaN
    3   6809    126293  NaN
    4   6809    126294  NaN
    5   6809    126295  -1.0
    6   5341    126293  NaN
    7   5341    126294  0.0
    8   5341    126295  NaN
    

    Now unstack it back

    df1.set_index(['level_1','level_0']).unstack()
    

    output:

    
            v
    level_0 5341    6807    6809
    level_1         
    126293  NaN     1.0     NaN
    126294  0.0     NaN     NaN
    126295  NaN     NaN    -1.0
    

    You get NaNs where you had no votes in either pro con or neu. The votes in those dfs that are for file_id/name_id not originally present in df are ignored