Search code examples
rdplyrdata-manipulationdata-wranglingsummarize

Manipulate dataframe and summarise


I have a large dataframe that I need to filter and calculate some summary statistics, the df looks similar to this:

ID<-c("KMT1","KMT1","KMT1","KMT2","KMT2","KMT2","KMT3","KMT3","KMT3")

Date<-c("01-03-2015","01-03-2015","01-03-2015",
        "04-06-2014","04-06-2014","04-06-2014",
        "07-01-2019","07-01-2019","07-01-2019")
TimeUTC<-c("10:22:05","10:22:05","10:22:05",
           "10:25:05","10:25:05","10:25:05",
           "10:23:05","10:23:05","10:23:05")

V1<-c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
V2<-c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
V3<-c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
V4<-c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)

Flag1<-c(0,0,0,1,0,1,0,0,1)
Flag2<-c(0,0,0,0,0,1,1,0,0)
Flag3<-c(0,0,0,0,0,0,1,1,0)

df1<-data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)

df1
    ID       Date  TimeUTC    V1    V2    V3    V4 Flag1 Flag2 Flag3
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080     0     0     0
2 KMT1 01-03-2015 10:22:05 0.003 0.002 0.008 0.009     0     0     0
3 KMT1 01-03-2015 10:22:05 0.040 0.020 0.060 0.080     0     0     0
4 KMT2 04-06-2014 10:25:05 0.030 0.003 0.090 0.090     1     0     0
5 KMT2 04-06-2014 10:25:05 0.020 0.002 0.004 0.004     0     0     0
6 KMT2 04-06-2014 10:25:05 0.050 0.090 0.050 0.050     1     1     0
7 KMT3 07-01-2019 10:23:05 0.030 0.010 0.010 0.050     0     1     1
8 KMT3 07-01-2019 10:23:05 0.100 0.050 0.003 0.030     0     0     1
9 KMT3 07-01-2019 10:23:05 0.020 0.023 0.040 0.100     1     0     0

I would like to be able to filter the df based on the ID column so that I can assess if and how many flags active on the ID (0/1) and if so, which flags. I then need to remove any ID group if a flag is active or remove 1 of the entries if the flag is not important e.g., only flag3 is active.

After filtering the data I would like to calculate the median of each ID, so that the df would have 2 rows. The desired output would look something like this:

(A) Removed ID(KMT3) as flag 3 was active

 ID       Date  TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05  0.010  0.020  0.040  0.080
2 KMT2 04-06-2014 10:25:05  0.025  0.025  0.047  0.047

(B) Removed ID(KMT2 and KMT3) as flags are active

    ID       Date  TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05   0.01   0.02   0.04   0.08

I am new to R and not sure on the best way to approach this problem, I have tried using filter() from dplyr package which can remove all rows which contain an active flag (e.g. Flag1 = 1), however, I need to assess which flags are active and how any before I can remove the ID.

I have managed to calculate the median of each column using:

DT1<-df2 %>%
  select("ID",V1:V4)%>%
  group_by(ID)%>%
  data.table()%>%
  na.omit()

setnames(DT1[, sapply(.SD, function(x) list(median(x))), by=ID], c("ID", sapply(names(DT1)[-1], paste0, c(".median"))))

However, doing this I lose the Date and Time columns which are important for further processes.

Any help would be greatly appreciated.


Solution

  • Here's some code that gets you 1 row per ID, including the total number of flags, the number of flags of each type, and the median of the V* columns:

    library(dplyr)
    df1 |>
      summarize(
        across(starts_with("Flag"), sum, .names = "n_{.col}"),
        across(V1:V4, median, .names = "med_{.col}"),
        .by = c(ID, Date, TimeUTC)
      ) |>
      mutate(
        n_total_flags = rowSums(across(starts_with("n_"))),
      )
    #     ID       Date  TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
    # 1 KMT1 01-03-2015 10:22:05       0       0       0   0.01  0.020   0.04   0.08             0
    # 2 KMT2 04-06-2014 10:25:05       2       1       0   0.03  0.003   0.05   0.05             3
    # 3 KMT3 07-01-2019 10:23:05       1       1       2   0.03  0.023   0.01   0.05             4
    

    You could use this result to do whatever logic you need to and join back to the original data to do filtering.