Search code examples
pythonpandasdataframejoinpivot-table

Pd pivot table lambda function to join column values with exceptions not working


I am currently working with a dataframe looking at Kentucky oil wells with pandas and want to create a pivot table using an API identifier. Since ther are various duplicates, I also wanted to join non unique values.

Below is an example of the dataframe:

import pandas as pd

df = pd.DataFrame({'API': ['16101030580000', '16101030580000', '16129057600000','16013006300000'], 
                   'Date': ['0000/00/00','6/15/2007', '5/25/2020', '7/31/2014'],
                   'Annual_Oil':[300,'nan',150, 360],
                   'State':['KY','None', 'None', 'KY']})

Additionally, I created a list of values that I did not want to join. However, when runnning the code, I get some values in the dataframe that should not be there.

list_none = ['none', 'nan', 'NAN','None', '0000/00/00','000']
df1 = pd.pivot_table(df, index = 'API',
                        aggfunc = lambda x: (','.join(x.unique().astype(str)) if x not in list_none else x),
                        sort = False)

The output for this example dataframe looks like

                Date                    Annual_Oil State
API         
16101030580000  0000/00/00,6/15/2007    300,nan    KY,None
16129057600000  5/25/2020               150        None
16013006300000  7/31/2014               360        KY

Is there a way to restructure the lambda function within the pivot table or would I have to get rid of the unwanted joins manually?


Solution

  • You should filter the values within the join:

    list_none = ['none', 'nan', 'NAN', 'None', '0000/00/00', '000']
    df1 = pd.pivot_table(
        df,
        index='API',
        aggfunc=lambda x: ','.join(
            i for i in x.unique().astype(str) if i not in list_none
        ),
        sort=False,
    )
    

    Alternative using a custom function:

    def cust_join(x):
        x = x.dropna().astype(str)
        return ','.join(x[~x.isin(list_none)].unique())
    
    df1 = pd.pivot_table(
        df,
        index='API',
        aggfunc=cust_join,
        sort=False,
    )
    

    Output:

                         Date Annual_Oil State
    API                                       
    16101030580000  6/15/2007        300    KY
    16129057600000  5/25/2020        150      
    16013006300000  7/31/2014        360    KY