Search code examples
pandasmethod-chaining

How to chain operations in pandas entirely in-line?


I often want to both manipulate and display a dataframe during a sequence of chained operations, for which I would use*:

df = (
  df

  #Modify the dataframe:
  .assign(new_column=...)

  #View result (without killing the chain)
  .pipe(lambda df_: display(df_) or df_)

  #...further chaining is possible
)

The code block above adds new_column to the dataframe, displays the new dataframe, and finally returns it. Chaining works here because display returns a falsy value (None).

My question is about scenarios where I want to replace display with plt.plot or some function that returns a truthy value. In such cases, df_ would no longer propagate through the chain.

Currently, my round this is to define an external function transparent_pipe that can run plt.plot or any other method(s), whilst also ensuring that the dataframe gets propagated:

def transparent_pipe(df, *funcs):
  [func(df) for func in funcs]
  return df

df = (
  df

  #Modify the dataframe:
  .assign(new_column=...)

  #Visualise a column from the modified df, without killing the chain
  .pipe(lambda df_: transparent_pipe(df_, plt.ecdf(df_.new_column), display(df_), ...)

  #...further chaining is possible
)

Question

Is there an entirely in-line way of doing this, without needing to define transparent_pipe?

Preferably just using pandas.


*Tip from Effective Pandas 2: Opinionated Patterns for Data Manipulation, M. Harrison, 2024.


Solution

  • With pyjanitor, you could use also:

    # pip install pyjanitor
    import janitor
    
    df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
            .also(display)
            .mul(10)
         )
    

    Alternatively, with a wrapper function to hide the output of any function and replace it by its first parameter (=the DataFrame):

    def hide(f):
        """The inner function should accept the DataFrame as first parameter"""
        def inner(df, *args, **kwargs):
            f(df, *args, **kwargs)
            return df
        return inner
    
    df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
            .pipe(hide(display))
            .mul(10)
         )
    

    Or, going like the original approach with short-circuiting:

    df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
            .pipe(lambda x: plt.ecdf(x['col1']) and False or x) # truthy output
            .pipe(lambda x: display(x['col1']) and False or x)  # falsy output
            .mul(10)
         )
    

    Or forcing a truthy with a tuple:

    df = (pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
            # example 1
            .pipe(lambda x: (display(x),) and x)
            # example 2
            .pipe(lambda x: (display(x), plt.ecdf(x['col1'])) and x)
            .mul(10)
         )