Search code examples
pythonpandasloops

For loop in Pandas allows applying some changes to the df whereas discarding others


When having multiple df's to which similar changes need to be made I usually tend to use a for loop as a quick solution to avoid repetitive work.

I understand that the df change happening in the loop is being applied to a copy. So for some methods I can overcome that with the 'inplace' parameter where available. Unfortunately for things like filtering I don't seem to have such an option. Below are examples of a code where I try to filter on rows where the date is higher or equal then the variable 'now'. Regarding the same filtering outside of the loop it will work correctly so it's not the matter of incorrect dtype or lack of data to be filtered out. Below are my code examples and would like to learn if the incorrect result is due to my poor knowledge or there are specific limitations and if so how do more experience users tackle this:

original code:

now = pd.to_datetime(date.today(), format='%Y%m%d')

for df in df_list:
    df.dropna(how='all', axis=1, inplace=True)
    df['Valid From Converted'] = pd.to_datetime(df['Valid From'], format='%Y%m%d')
    df['to Converted'] = pd.to_datetime(df['to'], format='%Y%m%d')
    df = df[df['to Converted'] >= now]

enumerate approach:

now = pd.to_datetime(date.today(), format='%Y%m%d')

for i, df in enumerate(df_list):
    df.dropna(how='all', axis=1, inplace=True)
    df['Valid From Converted'] = pd.to_datetime(df['Valid From'], format='%Y%m%d')
    df['to Converted'] = pd.to_datetime(df['to'], format='%Y%m%d')
    df = df[df['to Converted'] >= now]
    df_list[i] = df

range approach:

for i in range(len(df_list)):
    df_list[i].dropna(how='all', axis=1, inplace=True)
    df_list[i]['Valid From Converted'] = pd.to_datetime(df_list[i]['Valid From'], format='%Y%m%d')
    df_list[i]['to Converted'] = pd.to_datetime(df_list[i]['to'], format='%Y%m%d')
    df_list[i] = df_list[i][df_list[i]['to Converted'] >= now]

Solution

  • For code clarity, durability, and maintainability, avoid relying on side-effects but explicitly assign objects. One approach can be to define a method to handle the multiple step operations and call it in a list comprehension:

    def proc_frame(df):
        df = df.dropna(how='all', axis=1)
        df['Valid From Converted'] = pd.to_datetime(df['Valid From'], format='%Y%m%d')
        df['to Converted'] = pd.to_datetime(df['to'], format='%Y%m%d')
        return df[df['to Converted'] >= now]
    
    new_df_list = [proc_frame(df) for df in df_list]
    

    Additionally, consider chaining lines in right hand side methods like assign and query to avoid the df calls.

    def proc_frame(df):
        return (
            df.dropna(how='all', axis=1)
              .assign(**{
                  'Valid From Converted': lambda x: pd.to_datetime(x['Valid From'], format='%Y%m%d'),
                  'to Converted': lambda x: pd.to_datetime(x['to'], format='%Y%m%d')
              })
              .query('`to Converted` >= @now')
        )
    new_df_list = [proc_frame(df) for df in df_list]