python pandas dataframe pivot index-match

Fill matrix with value by searching for index and column names in other DataFrame

I have a "empty" data frame looking as follows:

        6807    6809    5341
126293  nan     nan     nan
126294  nan     nan     nan     
126295  nan     nan     nan

The column names give me an name_id whereas the index values give me a file_id. Now I want to search for the file_id and the name_id in separate pandas data frames named pro, cont, and neutral which look like this:

    file_id name_id
0   126293  7244
1   126293  4978
2   126293  5112
3   126293  6864

If I find the file_idand name_idin the prodataframe I want to fill the empty data frame cell above with 1, when found in cont then -1 when in neutral, then the value entered into the matrix should be 0. Giving me a result like this, e.g.:

        6807    6809    5341
126293  1       -1     0
126294  0       -1     0        
126295  1       -1     1

Does someone know how to get this done?

Solution

You can stack your 'empty' df (let's call it df) and merge against a combination of pro, con and neu. Then you can re-arrange it back into a 2d shape

Put the votes together into one dataframe:

votes = pd.concat([pro.assign(v=1), con.assign(v=-1), neu.assign(v=0)])
votes['name_id'] = votes['name_id'].astype(str) # you may or may not have to do this depending on what type your actual df is, as I have no way of knowing. It should match the type from columns in the empty df

votes now look like this (made up numbers by me):

    file_id name_id v
0   126293  6807    1
1   126293  4978    1
2   126293  5112    1
3   126293  6864    1
0   126295  6809    -1
0   126294  5341    0

Now we merge it to a stacked df on name_id and file_id:

df1  = (df.unstack()
            .reset_index()
            .merge(votes, left_on = ['level_0','level_1'], 
                right_on = [ 'name_id','file_id'], how='left')[['level_0', 'level_1', 'v']]
)

df1 looks like


    level_0 level_1 v
0   6807    126293  1.0
1   6807    126294  NaN
2   6807    126295  NaN
3   6809    126293  NaN
4   6809    126294  NaN
5   6809    126295  -1.0
6   5341    126293  NaN
7   5341    126294  0.0
8   5341    126295  NaN

Now unstack it back

df1.set_index(['level_1','level_0']).unstack()

output:


        v
level_0 5341    6807    6809
level_1         
126293  NaN     1.0     NaN
126294  0.0     NaN     NaN
126295  NaN     NaN    -1.0

You get NaNs where you had no votes in either pro con or neu. The votes in those dfs that are for file_id/name_id not originally present in df are ignored