Search code examples
rdataframeregressionlogistic-regressionnon-linear-regression

Turn numeric database into presence/absence dataframe for logistic regression


I have a database containing the number of cells in different growth stages (enlarging, thickening and maturing) for different trees during many years. I collected data every certain days of the year (DOY; January 1 would be DOY 1, January 2 would be DOY 2, etc.). I simplificated it like this to make a reproducible example:

df <- data.frame("Year" = c(2012, 2012, 2012, 2012, 2012, 2012, 2012,
                            2012, 2012, 2012, 2013, 2013, 2013,
                            2013, 2013),
                 "Tree" = c(15, 15, 15, 15, 15, 22, 22, 22, 22, 22, 41, 41,
                            41, 41, 41),
                 "DOY" = c(65, 97, 125, 177, 214, 65, 97, 125, 177, 214,
                           61, 99, 118, 166, 221),
                 "Enlarging" = c(0, 2, 4, 5, 0, 0, 3, 6, 3, 0, 0, 5, 4, 4, 0),
                 "Thickening" = c(0, 0, 2, 4, 0, 0, 0, 4, 3, 0, 0, 0, 3, 2, 0),
                 "Maturing" = c(0, 0, 3, 7, 0, 0, 0, 3, 4, 0, 0, 3, 6, 8, 0))

df <- df %>%
  mutate(Year = as.factor(Year),
         Tree = as.factor(Tree),
         DOY = as.numeric(DOY),
         Enlarging = as.numeric(Enlarging),
         Maturing = as.numeric(Maturing))

print(df)
   Year Tree DOY Enlarging Thickening Maturing
1  2012   15  65         0          0        0
2  2012   15  97         2          0        0
3  2012   15 125         4          2        3
4  2012   15 177         5          4        7
5  2012   15 214         0          0        0
6  2012   22  65         0          0        0
7  2012   22  97         3          0        0
8  2012   22 125         6          4        3
9  2012   22 177         3          3        4
10 2012   22 214         0          0        0
11 2013   41  61         0          0        0
12 2013   41  99         5          0        3
13 2013   41 118         4          3        6
14 2013   41 166         4          2        8
15 2013   41 221         0          0        0

I have two questions. The simple one is that I wanted to know how can I turn this type of database into a presence(1)/absence(0) dataframe. If the number of cells it's 0, keep it 0. If the number of cells is >=1, turn it to 1. Simple as that.

Second bonus question is that I wanted to fit a logistic regression using this 0/1 dataframe, but as you can see, my samplings took place every 30 days or more. I would like to fit a daily logistic regression, something like creating a sequence seq(1,365,1) of the 365 days of the year and predict daily values using this. This way I could predict daily values using the logistic regression and obtain which exact day did the growth of every stage start and end.

The second question could save me A LOT of time. I have tried different scripts and I always end up getting a different error. That's all I need,thank you so much, hope someone can help me.


Solution

  • To answer the first part, there are multiple ways to replace several column values with a 0/1. You could try:

    df[,4:6] <- (df[,4:6] > 0)*1 
    

    or

    df[2:3]<-lapply(df[2:3], function(x) +(x>0))
    

    You can use these values as outcomes in logistic regression, but unsure what you are looking for in your description (ie, prediction vs parameter estimation? Generalized estimating equations? Something else (ie, time to event analysis)?). If you provide examples of what you may want, I can edit the answer to provide more help. Good luck!