I have selected two columns in a groupby object. How do I apply a true or false filter on one and then apply a function on the other?

Say I have the Airbnb dataset with a bunch of columns. Of interest are 'neighbourhood_cleansed', 'host_is_superhost' and 'price'. I wish to find the neighbourhood in which the difference between the median prices of superhosts and non-superhosts is the maximum.

I want to know if this can be done entirely using pandas functions.

My logic is to group by 'neighbourhood_cleansed' at first, then filter the groupby object into superhosts and non-superhosts, and then use the median function.

I have defined a function func

def func(host_is_superhost, price):
    superhost_prices = price[host_is_superhost == 't']
    notsuperhost_prices = price[host_is_superhost == 'f']
    return (superhost_prices.median() - notsuperhost_prices.median())

listings = pd.read_csv("https://storage.googleapis.com/public-data-337819/listings%202%20reduced.csv",low_memory=False)
neighbourhoods = listings.groupby('neighbourhood_cleansed')[['host_is_superhost', 'price']]

When I run the following:

neighbourhoods.apply(func)

The error thrown is

TypeError: func() missing 1 required positional argument: 'price'

How do I solve this?

Do y'all have better ways of solving the initial question?

Solution

Your original func expected two arguments but only got one which is why you got the error message.

To understand what is going here when using apply try this example:

We can see that the 'source' column only has two values.

listings["source"].unique()

Out: array(['city scrape', 'previous scrape'], dtype=object)

Lets try a simpler version of func with a groupby on 'source':

def func2(row):
    print(type(row))
    print(row.shape)
    
    
grpby = listings.groupby("source")[["host_is_superhost", "price"]]
grpby.apply(func2)

Prints out:

<class 'pandas.core.frame.DataFrame'>
(55934, 2)
<class 'pandas.core.frame.DataFrame'>
(32012, 2)

This helps us understand that a when apply is used, func2 is being passed a single pd.DataFrame object with varying length.

An alternate approach that should accomplish what you want could use a pd.pivot_table to reshape the data and calculate the median of price. (Note that 'price' is not numeric and needs to be cleaned to be useful). For example:

listings["price_cleaned"] = (
    listings["price"].apply(lambda row: row.strip("$").replace(",", "")).astype(float)
)
pt = pd.pivot_table(
    listings,
    values="price_cleaned",
    index="neighbourhood_cleansed",
    columns="host_is_superhost",
    aggfunc="median",
)
pt["diff"] = pt["t"] - pt["f"]
mask = pt["diff"] == pt["diff"].max()
print(pt.index[mask][0])  # there is only one neighborhood in this case

'Westminster'