Search code examples
pandasgroup-bydata-analysis

I have selected two columns in a groupby object. How do I apply a true or false filter on one and then apply a function on the other?


Say I have the Airbnb dataset with a bunch of columns. Of interest are 'neighbourhood_cleansed', 'host_is_superhost' and 'price'. I wish to find the neighbourhood in which the difference between the median prices of superhosts and non-superhosts is the maximum.

I want to know if this can be done entirely using pandas functions.

My logic is to group by 'neighbourhood_cleansed' at first, then filter the groupby object into superhosts and non-superhosts, and then use the median function.

I have defined a function func

def func(host_is_superhost, price):
    superhost_prices = price[host_is_superhost == 't']
    notsuperhost_prices = price[host_is_superhost == 'f']
    return (superhost_prices.median() - notsuperhost_prices.median())
listings = pd.read_csv("https://storage.googleapis.com/public-data-337819/listings%202%20reduced.csv",low_memory=False)
neighbourhoods = listings.groupby('neighbourhood_cleansed')[['host_is_superhost', 'price']]

When I run the following:

neighbourhoods.apply(func)

The error thrown is

TypeError: func() missing 1 required positional argument: 'price'

How do I solve this?

Do y'all have better ways of solving the initial question?


Solution

  • Your original func expected two arguments but only got one which is why you got the error message.

    To understand what is going here when using apply try this example:

    We can see that the 'source' column only has two values.

    listings["source"].unique()
    
    Out: array(['city scrape', 'previous scrape'], dtype=object)
    

    Lets try a simpler version of func with a groupby on 'source':

    def func2(row):
        print(type(row))
        print(row.shape)
        
        
    grpby = listings.groupby("source")[["host_is_superhost", "price"]]
    grpby.apply(func2)
    

    Prints out:

    <class 'pandas.core.frame.DataFrame'>
    (55934, 2)
    <class 'pandas.core.frame.DataFrame'>
    (32012, 2)
    

    This helps us understand that a when apply is used, func2 is being passed a single pd.DataFrame object with varying length.

    An alternate approach that should accomplish what you want could use a pd.pivot_table to reshape the data and calculate the median of price. (Note that 'price' is not numeric and needs to be cleaned to be useful). For example:

    listings["price_cleaned"] = (
        listings["price"].apply(lambda row: row.strip("$").replace(",", "")).astype(float)
    )
    pt = pd.pivot_table(
        listings,
        values="price_cleaned",
        index="neighbourhood_cleansed",
        columns="host_is_superhost",
        aggfunc="median",
    )
    pt["diff"] = pt["t"] - pt["f"]
    mask = pt["diff"] == pt["diff"].max()
    print(pt.index[mask][0])  # there is only one neighborhood in this case
    

    'Westminster'