Search code examples
machine-learningneural-networkfeature-engineering

Dealing with Longitude and Latitude in Feature Engineering


I have a dataset which contains information about houses worldwide with the following features: house size, number of bedrooms, city name, country name, garden or not, ... (and many other typical house information). And the target variable is the price of the house.

I know that strings are not acceptable as input in a Machine Learning or Neural Network model so instead of doing one hot encoding for the city name and the country name (because I would end up with a few hundred columns) I decided to replace the city name with its geographical coordinates (one column with longitude and one column with latitude). The city where a house is located will obviously help determine the price of the house.

So does changing the city name with its longitude and latitude preserve this important information? Is it alright to change the city name with its longitude and latitude ?


Solution

  • Cartesian coordinates can be useful for the model to some extent. However, for certain models such as decision trees, properly modeling the dependency of the target variable on geographical coordinates can require overly complex models. For a clear and visual understanding of this you may check this.

    A common approach in these cases is to transform the coordinates into polar coordinates, and add them as new features. When you think about it, you're adding a new way of expressing a same thing, just in a different scale or system. That way a tree will require less splits to be able to model this spatial dependency of the samples.

    That being said, I would not completely replace the existing geolocation data with coordinates. It would probably be interesting too to add some aggregates/statistics based on the country of city data, rather than one hot encoding them or just replacing them by coordinates.