Search code examples
rdplyrnacalculated-columnssummarize

Calculate/Summarize new variable of 2 rows depending on second column gives NA


I am hoping for help from the swarm intelligence! In preparing my dataframe of laboratory values across different time points, I encounter the following issue. My dataframe is in long format consisting of 5 variables: the subject ID (sid), timepoint (zeitpunkt), intervention group of the clinical trial (gruppe), the laboratory parameter (parameter), the actual value of the parameter (messwert) and the unit of the value (einheit).

    A tibble: 6 x 6
    sid zeitpunkt gruppe parameter                      messwert einheit
  <dbl> <chr>     <chr>  <chr>                             <dbl> <chr>  
1     1 t0        DGE    ACPA(citrull. Prot.-Ak) EIA/Se  1000    U/ml   
2     1 t3        DGE    ACPA(citrull. Prot.-Ak) EIA/Se  1000    U/ml   
3     1 t2        DGE    Apolipoprot. A1 HP                 1.31 g/l    
4     1 t2        DGE    Apolipoprotein B                   0.61 g/l    
5     1 t0        DGE    aPTT Pathromtin SL                34.3  sek.   
6     1 t1        DGE    aPTT Pathromtin SL                37.3  sek.   
> 

I would now like to compare the change in value of each parameter in percent from baseline (t0) through the last timepoint (t3)in each individual, so that I can comopare the mean of the two interventions perspectively. Problematic: some values are missing, some might be existing twice but slightly differing in case the laboratory measured twice or samples were missing. Also, some parametres were only assessed at baseline.

I tried to calculate the change in percent by summarizing this new variable to a new

    dataframe: labor_change <-
      labor_long %>%
      group_by(sid, gruppe, parameter, zeitpunkt) %>%
      arrange(sid, parameter, zeitpunkt)%>%
      summarize(messwert=mean(messwert))%>%
      ungroup()%>%
      group_by(sid, gruppe, parameter) %>%
      summarize(changet3t0 =
                  (messwert[zeitpunkt == "t3"]-messwert[zeitpunkt = "t0"]/messwert[zeitpunkt == "t0"])*100)

My problem now: As soon as I use values from two different timepoints (aka 2 different rows, differing in a second variable "timepoint") in the code, R returns me the desired dataframe, but filled with NA only:

# Groups:   sid, gruppe, parameter [6]
    sid gruppe parameter             changet3t0
  <dbl> <chr>  <chr>                      <dbl>
1     1 DGE    aPTT Pathromtin SL            NA
2     1 DGE    Basophile %                   NA
3     1 DGE    Basophile absolut             NA
4     1 DGE    Calcium                       NA
5     1 DGE    Creatinkinase (CK) HP         NA
6     1 DGE    CRP HP                        NA
> 

As soon as I eliminate one timepoint out of the calculation, R gives me the desired calculated value. Any idea how I can fix this?


Solution

  • As mentioned by @Martin Gal at one place you are using = instead of == and currently you are using (x-y/x) but instead what you want is (x-y)/x which is different.

    I would also suggest to use match instead of == for comparison since match would return an NA if there is no "t3" or "t0" in the data.

    library(dplyr)
    
    labor_change  <- labor_long %>%
      group_by(sid, gruppe, parameter, zeitpunkt) %>%
      arrange(sid, parameter, zeitpunkt)%>%
      summarize(messwert=mean(messwert))%>%
      group_by(sid, gruppe, parameter) %>%
      summarize(changet3t0 = (messwert[match('t3', zeitpunkt)]-messwert[match("t0", zeitpunkt)])/
                  messwert[match("t0",zeitpunkt)]*100, .groups = 'drop')