Search code examples
rdataframedata-manipulationoutliersbigdata

Calculating outliers within specific niches of a dataframe? [Complex]


I've got a bit of a big problem here that I would really appreciate some help on. Essentially I have a large dataframe that looks like this. PLEASE NOTE ALL THIS R CODE IS IN TERMINAL AND NOT R STUDIO!

![Dataframe]https://i.sstatic.net/OkmfC.jpg

What I'm trying to do is separate the dataframe by the unique val_lvl2 treatments.

Here is code of exactly what I want to do but on a much larger scale.

Function code:

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y

}

CODE:

holder1 <- subset(z_combined_cost_dtrmnt, val_lvl2 ==  "Hammer Toe Repair")

holder1 <- holder1[!(holder1$episode_count <=3),]

holder1$prd_num_of_days_num <- remove_outliers(holder1$prd_num_of_days_num)

This will remove all of the outlier lengths for Hammer Toe Repair in val_lvl2 which is exactly what I want. However, I don't want to do this step every time since there are quite a few unique treatments! After removing all the outliers I need to also remove the NA columns and merge back all the data back into the one dataframe "z_combined_cost_dtrmnt" which should now have all outlier lengths removed from it uniquely for each unique treatment in val_lvl2. At this point the code above is as far as I've gotten with removing the outliers so help would be greatly appreciated because I am positive there is a more efficient way to do this then writing out this code for each treatment!

Here is every unique treatment in val_lvl2:![Unique values]https://i.sstatic.net/ky68G.jpg


Solution

  • You can use split to create a list of data frames by level of val_lvl2...

    holders <- split(z_combined_cost_dtrmnt, z_combined_cost_dtrmnt$val_lvl2)
    

    And then apply whatever functions you want to each element of that list using lapply, e.g.

    holders <- lapply(holders, function(x) x[!x$episode_count <= 3,])
    holders <- lapply(holders, function(x){
                        x$prd_num_of_days_num <- remove_outliers(x$prd_num_of_days_num)
                        return(x) })
    

    You will end up with a list of dataframes, one for each level of val_lvl2.