Search code examples
pythondataframeapply

How do I properly call a function and return an updated dataframe?


I am trying to process and update rows in a dataframe through a function, and return the dataframe to finish using it. When I try to return the dataframe to the original function call, it returns a series and not the expected column updates. A simple example is below:

df = pd.DataFrame(['adam', 'ed', 'dra','dave','sed','mike'], index =
['a', 'b', 'c', 'd', 'e', 'f'], columns=['A'])

def get_item(data):
    comb=pd.DataFrame()
    comb['Newfield'] = data     #create new columns
    comb['AnotherNewfield'] = 'y'

return pd.DataFrame(comb)

Caling a function using apply:

>>> newdf = df['A'].apply(get_item)

>>> newdf
a          A   Newfield AnotherNewfield
a  adam  st...
b          A   Newfield AnotherNewfield
e   sed  st...
c          A   Newfield AnotherNewfield
d  dave  st...
d          A   Newfield AnotherNewfield
d  dave  st...
e          A   Newfield AnotherNewfield
s   NaN  st...
f         A   Newfield AnotherNewfield
m  NaN  str(...
Name: A, dtype: object
>>> type(newdf)
<class 'pandas.core.series.Series'>

I assume that apply() is bad here, but am not quite sure how I 'should' be updating this dataframe via function otherwise.

Edit: I appologize but i seems I accidentally deleted the sample function on an edit. added it back here as I attempt a few other things I found in other posts.

Testing in a slightly different manner with individual variables - and returning multiple series variables -> seems to work so I will see if this is something I can do in my actual case and update.

def get_item(data):

    value = data     #create new columns
    AnotherNewfield = 'y'
    return pd.Series(value),pd.Series(AnotherNewfield)
df['B'], df['C'] = zip(*df['A'].apply(get_item))

Solution

  • You could use groupby with apply to get dataframe from apply call, like this:

    import pandas as pd
    
    # add new column B for groupby - we need single group only to do the trick
    df = pd.DataFrame(
        {'A':['adam', 'ed', 'dra','dave','sed','mike'], 'B': [1,1,1,1,1,1]},
        index=['a', 'b', 'c', 'd', 'e', 'f'])
    
    def get_item(data):
        # create empty dataframe to be returned
        comb=pd.DataFrame(columns=['Newfield', 'AnotherNewfield'], data=None)
        # append series data (or any data) to dataframe's columns 
        comb['Newfield'] = comb['Newfield'].append(data['A'], ignore_index=True)
        comb['AnotherNewfield'] = 'y'
        # return complete dataframe
        return comb
    
    # use column B for group to get tuple instead of dataframe
    newdf = df.groupby('B').apply(get_item)
    # after processing the dataframe newdf contains MultiIndex - simply remove the 0-level (index col B with value 1 gained from groupby operation)
    newdf.droplevel(0)
    

    Output:

        Newfield    AnotherNewfield
    0   adam        y
    1   ed          y
    2   dra         y
    3   dave        y
    4   sed         y
    5   mike        y