Search code examples
pythonpandasloopsdictionarydata-analysis

Better pattern for storing results in loop?


When I work with data, very often I will have a bunch of similar objects I want to iterate over to do some processing and store the results.

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))

results = []
for df in [df1, df2]:
    tmp_result = df.median()    # do some rpocessing
    results.append(tmp_result)  # append results

The problem I have with this is that it's not clear which dataframe the results correspond to. I thought of using the objects as keys for a dict, but this won't always work as dataframes are not hashable objects and can't be used as keys to dicts:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))

results = {}
for df in [df1, df2]:
    tmp_result = df.median()    # do some rpocessing
    results[df] = tmp_result    # doesn't work

I can think of a few hacks to get around this, like defining unique keys for the input objects before the loop, or storing the input and the result as a tuple in the results list. But in my experience those approaches are rather unwieldy, error prone, and I suspect they're not terrilbly great for memory usage either. Mostly, I just end up using the first example, and make sure I'm careful to manually keep track of the position of the results.

Are there any obvious solutions or best practices to this problem here?


Solution

  • You can keep the original dataframe and the result together in a class:

    class Whatever:
        def __init__(self, df):
            self.df = df
            self.result = None
    
    whatever1 = Whatever(pd.DataFrame(...))
    whatever2 = Whatever(pd.DataFrame(...))
    
    for whatever in [whatever1, whatever2]:
        whatever.result = whatever.df.median()
    

    There are many ways to improve this depending on your situation: generate the result right in the constructor, add a method to generate and store it, compute it on the fly from a property, and so on.