Search code examples
rdata-cleaning

Finding "outliers" in a group


I am working with hospital discharge data. All hospitalizations (cases) with the same Patient_ID are supposed to be of the same person. However I figured out that there are Pat_ID's with different ages and both sexes.

Imagine I have a data set like this:

Case_ID <- 1:8
Pat_ID <- c(rep("1",4), rep("2",3),"3")
Sex <- c(rep(1,4), rep(2,2),1,1)
Age <- c(rep(33,3),76,rep(19,2),49,15)
Pat_File <- data.frame(Case_ID, Pat_ID, Sex,Age)

Case_ID Pat_ID Sex Age
1       1      1   33
2       1      1   33
3       1      1   33
4       1      1   76
5       2      2   19
6       2      2   19
7       2      1   49
8       3      1   15 

It was relatively easy to identify Pat_ID's with cases that differ from each other. I found these ID's by calculating an average for age and/or sex (coded as 1 and 2) with help of the function aggregate and then calculated the difference between the average and age or sex. I would like to automatically remove/identify cases where age or sex deviate from the majority of the cases of a patient ID. In my example I would like to remove cases 4 and 7.


Solution

  • You could try

    library(data.table)
    

    Using Mode from Is there a built-in function for finding the mode?

      Mode <- function(x) {
        ux <- unique(x)
        ux[which.max(tabulate(match(x, ux)))]
     }
    
    setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
    #    Pat_ID Case_ID Sex Age
    #1:      1       1   1  33
    #2:      1       2   1  33
    #3:      1       3   1  33
    #4:      2       5   2  19
    #5:      2       6   2  19
    #6:      3       8   1  15
    

    Testing other cases,

     Pat_File$Sex[6] <- 1
     Pat_File$Age[4] <- 16
     setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
     #    Pat_ID Case_ID Sex Age
     #1:      1       1   1  33
     #2:      1       2   1  33
     #3:      1       3   1  33
     #4:      2       6   1  19
     #5:      3       8   1  15