I am helping my girlfriend making a model for her master thesis project (Env. Sci). The dataset has these columns: Site Distance(m) Depth (cm) pH %N %C C:N
She measured pH and total Carbon and total Nitrogen from soil/peat samples from 5 different mires (wetlands).
'Distance (m)' is the distance away from a not random starting point (the wet zone), it also goes backwards into negative values in some of the sites. C:N is derived from %N and %C, and Depth is the depth at which the soil sample was taken.
How should we model the data? We suspect there is a relation between all of the variables..
Should the data be grouped by site, and then do a regression model and then compare to the other sites? Or how to you deal with 'sites' (categorical variables) against numerical values?
You can use lots of technics to deal with that problem. One-Hot encoding is one of them. Actually it depends on your data. I highly recommend you to read this page to decide the best option: https://www.datacamp.com/community/tutorials/categorical-data Also you shouldn't select ur features by yourself.(We suspect there is a relation between all of the variables.. - > you dont have to determine which features are the most relevant ones). There is some methods that we can use. Check this out https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/