Search code examples
pythonloopsdictionarydataframeranking

Combining dataframes in Python to a dictionary using one of the dataframes as key


I have 3 dataframes, containing daily data: unique code, names, scores. First value in Row 1 is called Rank and then I have dates, first column under Rank contains the rank number (the first column is used as index).

**df1** UNIQUE CODES

Rank 12/8/2017 12/9/2017 .... 1/3/2018
1     Code_1      Code_3       Code_4
2     Code_2      Code_1       Code_2
...
1000  Code_5      Code_6       Code_7

**df2** NAMES

Rank 12/8/2017 12/9/2017 .... 1/3/2018
1     Jon        Maria         Peter
2     Brian      Jon           Maria
...
1000  Chris      Tim           Charles

**df3** SCORES

Rank 12/8/2017 12/9/2017 .... 1/3/2018
1     10           20           30
2     15           10           40
...
1000  25           15           20

Desired output:

I want to combine these dataframes into a dictionary, using df1 codenames as keys, so it will look like this:

dictionary = {'Code_1':[Jon, 20] , 'Code_2':[Brian, 15]}

As there are repeat competitors, I will need to sum their scores during all of the data series. So in the above examples, the Score_1 of Jon will contain scores for 12/8/2017 and 12/9/2017.

There are 1000 rows and 26 columns + index, so need a way to capture those. I think that a nested loop could work here, but don't have enough experience to build one that works.

In the end, I would like to sort the dictionary by highest score. Please suggest any solutions to this or more straightforward ways to combine this data and get the score ranking.

I attached pictures of dataframes, containing names, codes, and scores.

names

codes

scores

I used the proposed solution below on the 3 dataframes that I have. Please note that hashtags stands for code, players for names, and trophies for scores:

# reshape to get dates into rows
hashtags_reshaped = pd.melt(hashtags, id_vars = ['Rank'], 
                   value_vars = hashtags.columns, 
                   var_name = 'Date', 
                   value_name = 'Code').drop('Rank', axis = 1)

# reshape to get dates into rows
players_reshaped = pd.melt(players, id_vars = ['Rank'], 
                   value_vars = hashtags.columns, 
                   var_name = 'Date', 
                   value_name = 'Name').drop('Rank', axis = 1)

# reshape to get the dates into rows
trophies_reshaped = pd.melt(trophies, id_vars = ['Rank'], 
                   value_vars = hashtags.columns, 
                   var_name = 'Date', 
                   value_name = 'Score').drop('Rank', axis = 1)

# merge the three together. 
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([hashtags_reshaped['Date'], 
hashtags_reshaped['Code'], players_reshaped['Name'], 
trophies_reshaped['Score']]).T
print(merged_df)

# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)

summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)

But I'm getting a strange output: the summed scores should be in hundreds or low thousands (as an average score is 200-300 and an average participation frequency is 4-6 times). The score results I'm getting are way off, but their match codes and names correctly.

summed_df:

0       (MandiBralaX, 996871590076253)  
1              (Arso_C, 9955130513430)  
2               (ThatRainbowGuy, 9946)  
3                         (fabi, 9940)  
4                      (Dogão, 991917)  
5                      (Hierbo, 99168)  
6               (Clyde, 9916156180128)  
7      (.A.R.M.I.N., 9916014310187143)  
8                (keftedokofths, 9900)  
9                   (⚽AngelSosa⚽, 990)  
10                       (Totoo98, 99)  

group_df:

          Code             Name             Score  \
0       #JL2J02LY      MandiBralaX   996871590076253   
1       #80JQ90VC           Arso_C     9955130513430   
2       #9GGC2CUQ   ThatRainbowGuy              9946   
3       #8LL989QV             fabi              9940   
4        #9PPC89L            Dogão            991917   
5      #2JPLQ8JP8           Hierbo             99168

Solution

  • This should get you much of the way there. I didn't create a dictionary at the end as you specified; while you may need that format, you'd end up with nested dictionaries or lists, as each Code has 1 Name but possibly many Dates and Scores associated with it. How do you want those recorded - list, dict, etc?

    The code below returns a grouped dataframe; you can output it directly to a dict (shown), but you'll probably want to specify the format in detail, especially if you need an ordered dictionary. (Dictionaries are inherently not ordered; you'll have to from collections import OrderedDict and review that documentation if you really need an ordered dictionary.

    import pandas as pd
    
    #create the dfs; note that 'Code' is set up as a string
    df1 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['1', '2'], '12/9/2017': ['3', '1']})
    df1.set_index('Rank', inplace = True)
    
    # reshape to get dates into rows
    df1_reshaped = pd.melt(df1, id_vars = ['Rank'], 
                           value_vars = df1.columns, 
                           var_name = 'Date', 
                           value_name = 'Code').drop('Rank', axis = 1)
    #print(df1_reshaped)
    
    # create the second df
    df2 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['Name_1', 'Name_2'], '12/9/2017': ['Name_3', 'Name_1']})
    df2.set_index('Rank', inplace = True)
    
    # reshape to get dates into rows
    df2_reshaped = pd.melt(df2, id_vars = ['Rank'], 
                           value_vars = df1.columns, 
                           var_name = 'Date', 
                           value_name = 'Name').drop('Rank', axis = 1)
    #print(df2_reshaped)
    
    # create the third df
    df3 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['10', '20'], '12/9/2017': ['30', '10']})
    df3.set_index('Rank', inplace = True)
    
    # reshape to get the dates into rows
    df3_reshaped = pd.melt(df3, id_vars = ['Rank'], 
                           value_vars = df1.columns, 
                           var_name = 'Date', 
                           value_name = 'Score').drop('Rank', axis = 1)
    #print(df3_reshaped)
    
    # merge the three together. 
    # This _assumes_ that the dfs are all in the same order and that all the data matches up.
    merged_df = pd.DataFrame([df1_reshaped['Date'], df1_reshaped['Code'], df2_reshaped['Name'], df3_reshaped['Score']]).T
    print(merged_df)
    
    # group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
    grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
    print(grouped_df)
    
    summed_df = merged_df.drop('Date', axis = 1) \
        .groupby(['Code', 'Name']).sum() \
        .sort_values('Score', ascending = False).reset_index()
    summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
    print(summed_df)
    

    Unsorted dict:

    d = dict(zip(summed_df.Code, summed_df.li))
    print(d)
    

    You can make the OrderedDict directly, of course, and should:

    from collections import OrderedDict
    d2 = OrderedDict(zip(summed_df.Code, summed_df.li))
    print(d2)
    

    summed_df:

      Code    Name  Score            li
    0    3  Name_3     30  (Name_3, 30)
    1    1  Name_1     20  (Name_1, 20)
    2    2  Name_2     20  (Name_2, 20)
    

    d:

    {'3': ('Name_3', 30), '1': ('Name_1', 20), '2': ('Name_2', 20)}
    

    d2, sorted:

    OrderedDict([('3', ('Name_3', 30)), ('1', ('Name_1', 20)), ('2', ('Name_2', 20))])
    

    This returns your (name, score) as a tuple, not a list, but... it should get more of the way there.