python pandas performance dictionary structure

Efficient way to rebuild a dictionary of dataframes

I have a dictionary that is filled with multiple dataframes. Now I am searching for an efficient way for changing the key structure, but the solution I have found is rather slow when more dataframes / bigger dataframes are involved. Thats why I wanted to ask if anyone might know a more convenient / efficient / faster way or approach than mine. So first, I created this example to show where I initially started:

import pandas as pd
import numpy as np

# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}

# fill dic with random entries
for t1 in teams:

    dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30), 
                                  'Goals': pd.Series(np.random.randint(0,5, size = 30)),
                                  'Chances': pd.Series(np.random.randint(0,15, size = 30)),
                                  'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
                                  'Offside': pd.Series(np.random.randint(0, 10, size = 30))})

    dic_teams[t1] = dic_teams[t1].set_index('date')
    dic_teams[t1].index.name = None

Now I basically have a dictionary where every key is a team, which means I have a dataframe for every team with information on their game performance over time. Now I would prefer to change this particular dictionary so I get a structure where the key is the date, instead of a team. This would mean that I have a dataframe for every date, which is filled with the performance of each team on that date. I managed to do that using the following code, which works but is really slow once I add more teams and performance factors:

# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}

# new structure where key = date
for d in dates:
    dic_dates[d] = pd.DataFrame(index = teams, columns = perf)

    for t2 in teams:
        dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

Because I am using a nested loop, the restructuring of my dictionary is slow. Does anyone have an idea how I could improve the second piece of code? I'm not necessarily searching just for a solution, also for a logic or idea how to do better.

Thanks in advance, any help is highly appreciated

Solution

Creating a Pandas dataframes the way you do is (strangely) awfully slow, as well as direct indexing.

Copying a dataframe is surprisingly quite fast. Thus you can use an empty reference dataframe copied multiple times. Here is the code:

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}

# new structure where key = date
for d in dates:
    dic_dates[d] = zygote.copy()

    for t2 in teams:
        dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

This is about 2 times faster than the reference on my machine.

Overcoming the slow dataframe direct indexing is tricky. We can use numpy to do that. Indeed, we can convert the dataframe to a 3D numpy array, use numpy to perform the transposition, and finally convert the slices into dataframes again. Note that this approach assumes that all values are integers and that the input dataframe are well structured.

Here is the final implementation:

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}

# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
    full[tId,:,:] = dic_teams[tName].to_numpy()

# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
    dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)

This implementation is 6.4 times faster than the reference on my machine. Note that about 75% of the time is sadly spent in the pd.DataFrame calls. Thus, if you want a faster code, use a basic 3D numpy array!