Search code examples
rdataframeif-statementrecode

Recoding a complex composite score in R


Assume my research concerns an observational longitudinal cohort study.

Let γ_comp be the composite outcome of interest and γ1....γ4 at time t1 and t2 denote components of γ_comp. In addition, the dataset has three other variables (χ1, χ2, and χ3) which will be used in future analysis but are not necessary to code γ_comp. Here is an excerpt of the data.frame

df <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 
                    Y1_t1 = c(5, 6, 10, 7, 5, 7, 5, 4, 7, 4), 
                    Y2_t1 = c(6, 4, 8, 8, 7, 10, 7, 6, 5, 7), 
                    Y3_t1 = c(5, 6, 10, 4, 8, 5, 10, 5, 4, 6), 
                    Y4_t1 = c(4.5, 8.5, 9.5, 4.5, 5, 8, 4.5, 8.5, 4, 6), 
                    Y1_t2 = c(6, 4, 5, 5, 3, 4, 8, 4, 3, 2), 
                    Y2_t2 = c(5, 4, 3, 6, 5, 5, 5, 2, 2, 8), 
                    Y3_t2 = c(2, 2, 4, 5, 4, 9, 5, 3, 2, 4), 
                    Y4_t2 = c(3.5, 6, 5, 5, 4.5, 4, 2.5, 7, 4.5, 4), 
                    X1 = c(40, 45, 52, 44, 42, 65, 55, 61, 52, 49), 
                    X2 = c("NL", "UK", "NL", "US", "UK", "US", "NL", "NL", "UK", "UK"), 
                    X3 = c(2000, 2005, 2003, 2000, 2001, 2002, 2003, 2004, 2001, 2000)), 
                    class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L))

Structure

spec_tbl_df [10 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID   : num [1:10] 1 2 3 4 5 6 7 8 9 10
 $ Y1_t1: num [1:10] 5 6 10 7 5 7 5 4 7 4
 $ Y2_t1: num [1:10] 6 4 8 8 7 10 7 6 5 7
 $ Y3_t1: num [1:10] 5 6 10 4 8 5 10 5 4 6
 $ Y4_t1: num [1:10] 4.5 8.5 9.5 4.5 5 8 4.5 8.5 4 6
 $ Y1_t2: num [1:10] 6 4 5 5 3 4 8 4 3 2
 $ Y2_t2: num [1:10] 5 4 3 6 5 5 5 2 2 8
 $ Y3_t2: num [1:10] 2 2 4 5 4 9 5 3 2 4
 $ Y4_t2: num [1:10] 3.5 6 5 5 4.5 4 2.5 7 4.5 4
 $ X1   : num [1:10] 40 45 52 44 42 65 55 61 52 49
 $ X2   : chr [1:10] "NL" "UK" "NL" "US" ...
 $ X3   : num [1:10] 2000 2005 2003 2000 2001 ...

As mentioned earlier, I am interested in calculating γ_comp. The rules for recording are as follows:

  • 3 out of 4 components (i.e., γ1....γ4 must have more than 20% improvement (i.e. decrease) on numeric scale (0 - 10) [higher is worse] at t2 compared to t1).
  • In the "remaining component," there should be no worsening of more than 20% at t2 compared to t1

I believe the following steps have to be taken to achieve this aim. First, Y1_diff = Y1_t2/Y1_t1 must be calculated for every component. This is the proportion between two time points and should be <0.80. Next, an if_else condition has to be applied, which reinforces these rules and returns 1 if the rules are met and 0 if not (i.e., "responded" to treatment or not).

For example, this could be a desired output:

      ID Ycomp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 Y1_diff Y2_diff Y3_diff Y4_diff    X1 X2       X3
 1     1     0     5     6     5   4.5     6     5     2   3.5    1.2     0.83    0.4     0.78    40 NL     2000
 2     2     1     6     4     6   8.5     4     4     2   6      0.67    1       0.33    0.71    45 UK     2005
 3     3     1    10     8    10   9.5     5     3     4   5      0.5     0.38    0.4     0.53    52 NL     2003
 4     4     0     7     8     4   4.5     5     6     5   5      0.71    0.75    1.25    1.11    44 US     2000
 5     5     1     5     7     8   5       3     5     4   4.5    0.6     0.71    0.5     0.9     42 UK     2001
 6     6     0     7    10     5   8       4     5     9   4      0.57    0.5     1.8     0.5     65 US     2002
 7     7     0     5     7    10   4.5     8     5     5   2.5    1.6     0.71    0.5     0.56    55 NL     2003
 8     8     0     4     6     5   8.5     4     2     3   7      1       0.33    0.6     0.82    61 NL     2004
 9     9     1     7     5     4   4       3     2     2   4.5    0.43    0.4     0.5     1.13    52 UK     2001
10    10     1     4     7     6   6       2     8     4   4      0.5     1.14    0.67    0.67    49 UK     2000

I would appreciate any advice on recoding the composite score γ_comp. Alternative methods are also welcome. The idea is to use γ_comp in logistic regression in future analysis.


Solution

  • This should do it for you, assuming my understanding is correct:

    inner_join(
      df, 
      df %>%
        select(ID,starts_with("Y")) %>% 
        pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
        pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
        mutate(change=1-t2/t1) %>% 
        group_by(ID) %>% 
        mutate(impct = sum(change>0.2)) %>% 
        summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
    ) %>% relocate(Y_comp,.after = ID)
    

    Output:

          ID Y_comp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2    X1 X2       X3
       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
     1     1      0     5     6     5   4.5     6     5     2   3.5    40 NL     2000
     2     2      1     6     4     6   8.5     4     4     2   6      45 UK     2005
     3     3      1    10     8    10   9.5     5     3     4   5      52 NL     2003
     4     4      0     7     8     4   4.5     5     6     5   5      44 US     2000
     5     5      1     5     7     8   5       3     5     4   4.5    42 UK     2001
     6     6      0     7    10     5   8       4     5     9   4      65 US     2002
     7     7      0     5     7    10   4.5     8     5     5   2.5    55 NL     2003
     8     8      0     4     6     5   8.5     4     2     3   7      61 NL     2004
     9     9      1     7     5     4   4       3     2     2   4.5    52 UK     2001
    10    10      1     4     7     6   6       2     8     4   4      49 UK     2000
    

    Explanation:

    This is an inner join between df, and a new dataframe that contains two columns ID and Y_comp. How is this second frame created?

    1. I select the columns ID and those starting with "Y"
    2. I pivot long, and the pivot wide to get the data into a format with four columns (ID, Y, t1, and t2).
    3. On each row, I estimate the change as 1-t2/t1.
    4. For each ID (group_by(ID)), I generate a column impt as the number of times change exceeds 0.2. This is constant over ID
    5. For each ID, I define Y_comp as TRUE if all of the rows have impct==4 (i.e. all are improvements) OR, if three are improvements and the minimum in the set is not less than negative 0.2).
    6. I multiply by 1 in that same line, to convert Y_comp to numeric 1/0, rather than T/F
    7. After the join is completed, I move Y_comp after ID, using relocate()

    Update:

    The OP is having an error, likely caused by namespace collision; one solution is to be specific about the packages being used:

    library(magrittr)
    dplyr::inner_join(
      df, 
      df %>%
        dplyr::select(ID,starts_with("Y")) %>% 
        tidyr::pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
        tidyr::pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
        dplyr::mutate(change=1-t2/t1) %>% 
        dplyr::group_by(ID) %>% 
        dplyr::mutate(impct = sum(change>0.2)) %>% 
        dplyr::summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
    ) %>% dplyr::relocate(Y_comp,.after = ID)