Say I have the Airbnb dataset with a bunch of columns. Of interest are 'neighbourhood_cleansed', 'host_is_superhost' and 'price'. I wish to find the neighbourhood in which the difference between the median prices of superhosts and non-superhosts is the maximum.
I want to know if this can be done entirely using pandas functions.
My logic is to group by 'neighbourhood_cleansed' at first, then filter the groupby object into superhosts and non-superhosts, and then use the median function.
I have defined a function func
def func(host_is_superhost, price):
superhost_prices = price[host_is_superhost == 't']
notsuperhost_prices = price[host_is_superhost == 'f']
return (superhost_prices.median() - notsuperhost_prices.median())
listings = pd.read_csv("https://storage.googleapis.com/public-data-337819/listings%202%20reduced.csv",low_memory=False)
neighbourhoods = listings.groupby('neighbourhood_cleansed')[['host_is_superhost', 'price']]
When I run the following:
neighbourhoods.apply(func)
The error thrown is
TypeError: func() missing 1 required positional argument: 'price'
How do I solve this?
Do y'all have better ways of solving the initial question?
Your original func
expected two arguments but only got one which is why you got the error message.
To understand what is going here when using apply
try this example:
We can see that the 'source'
column only has two values.
listings["source"].unique()
Out: array(['city scrape', 'previous scrape'], dtype=object)
Lets try a simpler version of func
with a groupby on 'source'
:
def func2(row):
print(type(row))
print(row.shape)
grpby = listings.groupby("source")[["host_is_superhost", "price"]]
grpby.apply(func2)
Prints out:
<class 'pandas.core.frame.DataFrame'>
(55934, 2)
<class 'pandas.core.frame.DataFrame'>
(32012, 2)
This helps us understand that a when apply
is used, func2
is being passed a single pd.DataFrame
object with varying length.
An alternate approach that should accomplish what you want could use a pd.pivot_table
to reshape the data and calculate the median of price
. (Note that 'price' is not numeric and needs to be cleaned to be useful). For example:
listings["price_cleaned"] = (
listings["price"].apply(lambda row: row.strip("$").replace(",", "")).astype(float)
)
pt = pd.pivot_table(
listings,
values="price_cleaned",
index="neighbourhood_cleansed",
columns="host_is_superhost",
aggfunc="median",
)
pt["diff"] = pt["t"] - pt["f"]
mask = pt["diff"] == pt["diff"].max()
print(pt.index[mask][0]) # there is only one neighborhood in this case
'Westminster'