Search code examples
modeling

How to make a statistical model with data from different locations (categorical variables)?


I am helping my girlfriend making a model for her master thesis project (Env. Sci). The dataset has these columns: Site Distance(m) Depth (cm) pH %N %C C:N

She measured pH and total Carbon and total Nitrogen from soil/peat samples from 5 different mires (wetlands).

'Distance (m)' is the distance away from a not random starting point (the wet zone), it also goes backwards into negative values in some of the sites. C:N is derived from %N and %C, and Depth is the depth at which the soil sample was taken.

How should we model the data? We suspect there is a relation between all of the variables..

Should the data be grouped by site, and then do a regression model and then compare to the other sites? Or how to you deal with 'sites' (categorical variables) against numerical values?


Solution

  • You can use lots of technics to deal with that problem. One-Hot encoding is one of them. Actually it depends on your data. I highly recommend you to read this page to decide the best option: https://www.datacamp.com/community/tutorials/categorical-data Also you shouldn't select ur features by yourself.(We suspect there is a relation between all of the variables.. - > you dont have to determine which features are the most relevant ones). There is some methods that we can use. Check this out https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/

    https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2