Search code examples
python-3.xpandasdataframegroup-by

Series not being passed to a UDF in .transform() in pandas


from what I've read from different answers on stackoverflow and other resources, when providing the .transform() with a UDF, each column is passed one by one for each Group

But when i tried it myself, i saw a Dataframe being passed into the UDF

df = pd.Dataframe({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]}
def inspect(x):
    print(type(x))

df.groupby('State').transform(inspect)

# Output 
# <class 'pandas.core.series.Series'>
# <class 'pandas.core.series.Series'>
# <class 'pandas.core.frame.DataFrame'>
# <class 'pandas.core.series.Series'>
# <class 'pandas.core.series.Series'>

the Dataframe passed to the inspect happens to be the Dataframe of the first group (State = Florida). But no one has mentioned and talked about a Dataframe being passed when working with UDFs while using .transform().

my question is :

  • Why is a Dataframe passed to the inspect function when everyone says a Series (each column) is passed to the UDF?
  • why was the Dataframe of the first groupby object passed to the inspect? why wasn't the second groupby passed to the inspect ?

Solution

  • According to the groupby.transform documentation (see the highlighted part):

    The current implementation imposes three requirements on f:

    • f must return a value that either has the same shape as the input subframe or can be broadcast to the shape of the input subframe. For example, if f returns a scalar it will be broadcast to have the same shape as the input subframe.
    • if this is a DataFrame, f must support application column-by-column in the subframe. If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.
    • f must not mutate groups. Mutation is not supported and may produce unexpected results. See Mutating with User Defined Function (UDF) methods for more details.

    I thus believe that transform is performing this check. Indeed, if we identify the order of the groups transforms using a counter, we indeed have successive numbers, except after the first group:

    from itertools import count
    
    df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida', 'Washington'], 
                       'a':[4,5,1,3,2], 'b':[6,10,3,11,12]})
    
    c = count()
    def inspect(x):
        x = next(c)
        return x
    
    df.groupby('State').transform(inspect)
    

    Output, notice that step 2 is missing, likely when the check for a DataFrame happens:

       a  b
    0  3  4  # second group (3 and 4)
    1  3  4
    2  0  1  # first group (0 and 1)
    3  0  1
    4  5  6  # third group (5 and 6)