Search code examples
rdataframedplyraveragedistinct-values

Calculation averages per patient using n_distinct with condition


In my dataframe I would like to calculate the absolute frequencies of performed healthcare activities, the relative frequency and the average number performed per patient.

I used the following code to calculate healthcare utilization:

Df %>%
   group_by(A) %>%
   summarize(n = n()) %>%
   mutate(rel.freq = (n/sum(n))*100) %>%
   mutate(avg.A.pt = n/sum(n_distinct(Person[A == A])))   

I have a problem with the last line of the code. I need to calculate the number of activities per patient for one specific type of care, calculated as total number of the activity n divided by the unique number of patients n_distinct(Person), but only divided by the patients that recived that specific type of care Person[HCU == HCU].

The results I am aiming for would look like this:

*HCU    n     rel.freq     avg.hcu.pt*
ECG   486      10%          4.0
Echo  301      8%           1.8

Could you help me fix the code?

Thank you in advance!


After a reply, some extra information:

I work with remote access in a secured environment, so unfortunately I'm not able to provide you with samples of the data. I have a dataset of approximately 20.000 patients who received 11.000.000 healthcare activities (rows) and 34 columns with for example the specialism, healthcare centre, age and a personal code. For my article I would like to show: - a percentage of (unique) patients who received a specific healthcare activity at least once (I called it the relative frequency) - Average number of the healthcare activity (of a specific type) per (unique) patient

Basically I have mapped the types of care, for example lab test using group_by and filter of dplyr, this gave me the total number of lab tests. But now I want to specify it, for example how many patients had at least one MRI, how many never had an MRI, how many MRI’s (on average) did the patients receive.

I tried your suggestion

Df %>%
Group_by(A, Person) %>%
Summarise(n = n())

# A= healthcare activities

Which gives me:

A            Person         n
MRI        1                 6
MRI        2                 2
… for all >1000 patients who received MRI
Echo      1                 3
And so on

How do I get the % of patients with MRI? And the average number of MRI’s per patient?


Solution

  • Lets create some toy data. Four treatments with different probabilities. 100 patients visit 1000 times.

    set.seed(123)
    df<-data.frame(A = sample(c("MRI", "ECG", "Echo", "PET"), 1000,
                              prob=c(0.05, 0.8, 0.13, 0.02), replace=TRUE),
                   p = sample(1:100, 1000, replace=TRUE))
    

    Now we aggregate the data

        df %>% 
      # group by Treatment and patients
      group_by(A, p) %>% 
      # first summary is the number of a specific treatments for each patient
      summarise(n = n()) %>% 
      # next summary we sum the number distinct patients in the group
      # and divide by sum the number of distinct patients to get the rel.freq of the treatment.
      # Then we take the mean value of the number of treatment pr. patient 
      summarise(rel.freq   = n_distinct(p)/n_distinct(df$p),
                avg.hcu.pt = mean(n))
    

    Result

    # A tibble: 4 x 3
    A     rel.freq avg.hcu.pt
    <fct>    <dbl>      <dbl>
    1 ECG       1          8.02
    2 Echo      0.76       1.72
    3 MRI       0.37       1.30
    4 PET       0.17       1.12