Search code examples
rgroupingdifference

How to find difference between values based on condition by group in r and create new column with that value?


I need to calculate the thickness of the hypoxic layer (hypoxia == ODO_mgL < 2.0) by CRN. I need the depth (in meters) of how thick hypoxia is for DO < 2.0 mgL (create new column for this thickness of the hypoxic layer) by CRN. For example, if I have a CRN (site) with DO > 2 from depth 0m to depth 9m and DO drops below 2 from depth 9m to depth 10m the thickness would be 1m. If there is no hypoxia for a particular site then the hypoxia thickness should be 0m. I need to create a new column that calculates the difference between the depth where hypoxia (DO<2) starts and the maximum depth for each unique CRN.

I have spent over 4 hours trying to do this (still learning r) and feel like I am getting close, but have not figured it out. I have looked in SO and other resources, but maybe I am not wording this correctly.

Example df:

structure(list(DATE = c("8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", "8/16/2021", 
"8/16/2021", "8/16/2021"), TIME = c("9:00:37 AM", "9:00:38 AM", 
"9:00:39 AM", "9:00:40 AM", "9:00:41 AM", "9:00:42 AM", "9:00:43 AM", 
"9:00:44 AM", "9:00:45 AM", "9:00:46 AM", "9:00:47 AM", "9:00:48 AM", 
"9:00:49 AM", "9:00:50 AM", "9:00:51 AM", "9:00:52 AM", "9:00:53 AM", 
"9:00:54 AM", "9:00:55 AM", "9:00:56 AM", "9:00:57 AM", "9:00:58 AM", 
"9:00:59 AM", "9:01:00 AM", "9:01:01 AM", "9:01:02 AM", "9:01:03 AM", 
"9:01:04 AM", "9:01:05 AM", "9:01:06 AM", "9:01:07 AM", "9:01:08 AM", 
"9:01:09 AM", "9:01:10 AM", "9:01:11 AM", "9:01:12 AM", "9:01:13 AM", 
"9:01:14 AM", "9:01:15 AM", "9:01:16 AM", "9:01:17 AM", "9:01:18 AM", 
"9:01:19 AM", "9:01:20 AM", "9:01:21 AM", "9:01:22 AM", "9:01:23 AM", 
"9:01:24 AM", "9:01:25 AM", "9:01:26 AM", "9:01:27 AM", "9:01:28 AM", 
"9:01:29 AM", "9:01:30 AM", "9:01:31 AM", "9:01:32 AM", "9:01:33 AM", 
"9:01:34 AM", "9:01:35 AM", "9:01:36 AM", "9:01:37 AM", "9:01:38 AM", 
"9:01:39 AM", "9:01:40 AM", "9:01:41 AM", "9:01:42 AM", "9:01:43 AM", 
"9:01:44 AM", "9:01:45 AM", "9:01:46 AM", "9:01:47 AM", "9:01:48 AM", 
"9:01:49 AM", "9:01:50 AM", "9:01:51 AM", "9:01:52 AM", "9:01:53 AM", 
"9:01:54 AM", "9:01:55 AM", "9:01:56 AM"), CRN = c(801, 801, 
801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 
801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 
801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 
801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 
801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 
801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801, 801
), ODO_mgL = c(8.84, 8.8, 8.76, 8.72, 8.69, 8.65, 8.63, 8.59, 
8.57, 8.54, 8.49, 8.44, 8.39, 8.35, 8.31, 8.28, 8.25, 8.23, 8.21, 
8.19, 8.17, 8.15, 8.14, 8.12, 8.11, 8.1, 8.09, 7.65, 7.35, 7.1, 
7.11, 7.08, 7.01, 6.56, 6.41, 6.28, 6.22, 6.08, 5.66, 5.53, 5.38, 
5.16, 5.19, 5.1, 5.02, 4.59, 4.43, 4.39, 4.46, 4.44, 4.37, 4.31, 
4.25, 3.8, 3.71, 3.6, 3.57, 3.51, 3.47, 3.43, 3.39, 3.34, 2.95, 
2.81, 2.66, 2.59, 2.51, 2.44, 2.38, 2.32, 2.27, 1.91, 1.85, 1.78, 
1.76, 1.72, 1.7, 1.67, 1.66, 1.63), Depth_m = c(0.039, 0.043, 
0.052, 0.757, 0.678, 0.764, 0.764, 0.833, 0.837, 0.838, 0.857, 
0.893, 2.01, 2.155, 2.368, 2.368, 2.368, 2.406, 4.205, 4.299, 
4.265, 4.265, 4.252, 4.252, 4.253, 4.259, 4.27, 5.291, 5.498, 
5.387, 5.479, 5.479, 5.486, 5.513, 5.562, 5.628, 5.668, 5.722, 
5.772, 5.82, 5.855, 5.917, 5.958, 6, 6.036, 6.06, 6.102, 7.063, 
7.059, 7.035, 6.982, 6.984, 6.997, 7.032, 7.729, 7.581, 7.629, 
7.629, 7.649, 7.68, 7.756, 7.844, 7.972, 8.041, 8.1, 8.159, 8.225, 
8.316, 9.063, 9.218, 9.183, 9.137, 9.159, 9.188, 9.315, 9.52, 
9.625, 9.698, 9.754, 9.816), Vertical_Position_m = c(0.073, 0.223, 
0.662, 0.766, 0.684, 0.725, 0.892, 0.926, 0.781, 0.784, 1.013, 
1.467, 1.848, 2.035, 2.273, 2.359, 2.42, 3.132, 3.827, 4.226, 
4.254, 4.18, 4.227, 4.283, 4.272, 4.29, 4.65, 5.081, 5.352, 5.396, 
5.452, 5.488, 5.568, 5.742, 5.737, 5.761, 5.956, 6.049, 6.161, 
6.163, 6.142, 6.421, 6.426, 6.47, 6.468, 6.37, 6.81, 6.995, 7.052, 
7.033, 6.981, 7.002, 7.12, 7.518, 7.685, 7.637, 7.602, 7.674, 
7.702, 7.926, 8.146, 8.14, 8.34, 8.341, 8.475, 8.724, 8.741, 
8.952, 9.005, 9.166, 9.168, 9.135, 9.353, 9.556, 9.736, 9.874, 
9.917, 9.902, 10.15, 10.221), Temp_C = c(23, 23.0555555555556, 
23.1111111111111, 23.1666666666667, 23.2222222222222, 23.2777777777778, 
23.2777777777778, 23.3888888888889, 23.4444444444444, 23.4444444444444, 
23.5, 23.5555555555556, 23.6666666666667, 23.6666666666667, 23.7777777777778, 
23.7777777777778, 23.8333333333333, 23.8333333333333, 23.8333333333333, 
23.8333333333333, 23.8333333333333, 23.8333333333333, 23.8333333333333, 
23.8333333333333, 23.8333333333333, 23.8333333333333, 23.8333333333333, 
23.8333333333333, 23.8333333333333, 23.8333333333333, 22.5, 22.3888888888889, 
22.3888888888889, 22.2777777777778, 22.2222222222222, 22.0555555555556, 
21.9444444444444, 21.8333333333333, 21.5555555555556, 21.3888888888889, 
21.3333333333333, 21.1111111111111, 19.9444444444444, 19.8888888888889, 
19.7777777777778, 19.7777777777778, 19.7777777777778, 17.8333333333333, 
17.4444444444444, 17.3333333333333, 17.3333333333333, 17.3333333333333, 
17.3333333333333, 17.3333333333333, 17.3333333333333, 17.2777777777778, 
17.2222222222222, 17.1666666666667, 17.1111111111111, 17.0555555555556, 
17, 17, 16.8888888888889, 16.8888888888889, 16.8333333333333, 
16.7777777777778, 16.7222222222222, 16.6666666666667, 16.6666666666667, 
16.6111111111111, 16.5555555555556, 16.5, 16.4444444444444, 16.3333333333333, 
16.2777777777778, 16.2222222222222, 16.1666666666667, 16.1111111111111, 
16, 15.9444444444444)), row.names = c(NA, 80L), class = "data.frame")

I have tried the following:

dfso %>% 
  group_by(CRN) %>%
  summarise(thick = Depth_m[])

this just gives me the same values in the Depth_m column, so not helpful

dfso %>% 
  group_by(CRN, DO2 = ifelse(ODO_mgL < 2, "below", "above")) %>% 
  mutate(HypoxThick = Depth_m-lag(Depth_m))

this does that lag/stepwise (not sure how you call it) difference, which is not what I need

dfso %>% 
  group_by(CRN, DO2thick = ifelse(ODO_mgL < 2, "below", "above")) %>% 
  mutate(HypThick = max(Depth_m))

this just gives me the maximum depth by CRN

dfso %>%
  group_by(CRN) %>%
  mutate(thick = case_when(ODO_mgL<2 ~ max(Depth_m)-Depth_m))

this is not helpful

dfso %>% 
  group_by(CRN, DO2 = ifelse(ODO_mgL < 2, "below", "above")) %>% 
  mutate(thick = max(Depth_m) - min(Depth_m))

This last one is extremely close to what I need. In this case the hypoxia thickness is 0.679m , but how do I get this value to also be applied to that ODO_mgL cutoff? That whole "thick" column should have only the 0.679 value.

I looked at the following posts: this, this, this, this, this, and many other SO posts, and blog posts elsewhere. Thank you for your time and help!


Solution

  • This solution assumes that your data is sorted by depth (i.e. that the depth where hypoxia starts is the maximum depth for that hypoxic layer). It goes a bit further than your last attempt in that it has thickness of 0 if the row is not in a hypoxic layer, and it should also work if you have multiple hypoxic layers in the data.

    My approach was:

    • create a new variable hypoxic_layer which begins at 0 and increments at the start of every hypoxic layer.
    • then set this colum = 0 whenever the ODO_mgL is above 2, so 0 means no hypoxic layer, and all hypoxic layers get their own number
    • next group by CRN and hypoxic_layer and calculate the thickness as the difference between min() and max() of Depth_m within each group.
    library(dplyr)
    
    want <- dfso %>% 
        mutate(hypoxic_layer=cumsum(if_else(CRN==lag(CRN) & ODO_mgL<2 & lag(ODO_mgL)>=2,1,0)),       # this column increments at the beginning of each new hypoxic layer
               hypoxic_layer=if_else(ODO_mgL>=2,0,hypoxic_layer)) %>%                # set hypoxic layer = 0 if ODOmgL>=2
        group_by(CRN,hypoxic_layer) %>%                                              # group by each hypoxic layer in each CRN 
        mutate(thickness=if_else(hypoxic_layer==0,0,max(Depth_m)-min(Depth_m)))      # difference between max and min depth in each group is the thickness
    
    

    Edit

    As per clarification that the thickness should be applied to all rows in each CRN, I have updated the code. I use summarise instead of mutate to calculate the thickness with just 1 row per hypoxic layer, then use right_join to merge the thickness column with the original data.

    want <- dfso %>% 
        mutate(hypoxic_layer=cumsum(if_else(CRN==lag(CRN) & ODO_mgL<2 & lag(ODO_mgL)>=2,1,0)),       # this column increments at the beginning of each new hypoxic layer
               hypoxic_layer=if_else(ODO_mgL>=2,0,hypoxic_layer)) %>%                # set hypoxic layer = 0 if ODOmgL>=2
        group_by(CRN,hypoxic_layer) %>%                                              # group by each hypoxic layer in each CRN 
        summarise(thickness=max(Depth_m)-min(Depth_m))  %>%                          # difference between max and min depth in each group is the thickness
        filter(hypoxic_layer != 0) %>%                                               # remove rows which are not hypoxic layers
        group_by(CRN) %>%                                                            #
        summarise(thickness=max(thickness)) %>%                                      # in case of multiple hypoxic layers in any CRN we take the maximum thickness
        right_join(dfso,by='CRN')                                                    # join this to our original data
    

    I included the group_by and summarise lines before the final right_join to catch any possible instance where there might be more than 1 hypoxic layer in an individual CRN. If this ever occurs, then the maximum thickness will be used.