from what I've read from different answers on stackoverflow and other resources, when providing the .transform()
with a UDF, each column is passed one by one for each Group
But when i tried it myself, i saw a Dataframe being passed into the UDF
df = pd.Dataframe({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]}
def inspect(x):
print(type(x))
df.groupby('State').transform(inspect)
# Output
# <class 'pandas.core.series.Series'>
# <class 'pandas.core.series.Series'>
# <class 'pandas.core.frame.DataFrame'>
# <class 'pandas.core.series.Series'>
# <class 'pandas.core.series.Series'>
the Dataframe passed to the inspect
happens to be the Dataframe of the first group (State = Florida). But no one has mentioned and talked about a Dataframe being passed when working with UDFs while using .transform()
.
my question is :
inspect
function when everyone says a Series (each column) is passed to the UDF?inspect
? why wasn't the second groupby passed to the inspect
?According to the groupby.transform
documentation (see the highlighted part):
The current implementation imposes three requirements on
f
:
f
must return a value that either has the same shape as the input subframe or can be broadcast to the shape of the input subframe. For example, iff
returns a scalar it will be broadcast to have the same shape as the input subframe.- if this is a DataFrame,
f
must support application column-by-column in the subframe. Iff
also supports application to the entire subframe, then a fast path is used starting from the second chunk.f
must not mutate groups. Mutation is not supported and may produce unexpected results. SeeMutating with User Defined Function (UDF) methods
for more details.
I thus believe that transform
is performing this check. Indeed, if we identify the order of the groups transforms using a counter, we indeed have successive numbers, except after the first group:
from itertools import count
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida', 'Washington'],
'a':[4,5,1,3,2], 'b':[6,10,3,11,12]})
c = count()
def inspect(x):
x = next(c)
return x
df.groupby('State').transform(inspect)
Output, notice that step 2
is missing, likely when the check for a DataFrame happens:
a b
0 3 4 # second group (3 and 4)
1 3 4
2 0 1 # first group (0 and 1)
3 0 1
4 5 6 # third group (5 and 6)