python machine-learning data-science linear-regression

How do I encode location data for linear regression in Python?

I am doing a beginner project to predict house prices. I have one category called 'city' with values such as Boston, Detroit, NY, etc.

If I use One Hot Encoder I will end up with a very huge dataset because there are around 100 unique values. But I am unsure about using Label Encoding because the regression may treat the values as numerical.

I thought that would be plenty explained on the internet as it is a typical exercise but I am unable to find a solution. How do I encode location data?

Solution

If you have a categorical feature with many unique values, try using target encoding

The idea is to calculate average price for each city and then replace city name with that average for each row. This average price will tell your model how attractive this city in general.

One possible implementation is this:

mean_by_city = df.groupby("city").agg({"price": "mean"}).squeeze().to_dict()
df["city_mean_encoding"] = df["city"].map(mean_by_city)

One advice: don't forget to check for rare categories. If your dataset has only few examples of houses in some small town, the mean might not be accurate. In this case I can recommend to add a dummy "Rare" category.