I have a large dataframe that I need to filter and calculate some summary statistics, the df looks similar to this:
ID<-c("KMT1","KMT1","KMT1","KMT2","KMT2","KMT2","KMT3","KMT3","KMT3")
Date<-c("01-03-2015","01-03-2015","01-03-2015",
"04-06-2014","04-06-2014","04-06-2014",
"07-01-2019","07-01-2019","07-01-2019")
TimeUTC<-c("10:22:05","10:22:05","10:22:05",
"10:25:05","10:25:05","10:25:05",
"10:23:05","10:23:05","10:23:05")
V1<-c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
V2<-c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
V3<-c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
V4<-c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
Flag1<-c(0,0,0,1,0,1,0,0,1)
Flag2<-c(0,0,0,0,0,1,1,0,0)
Flag3<-c(0,0,0,0,0,0,1,1,0)
df1<-data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)
df1
ID Date TimeUTC V1 V2 V3 V4 Flag1 Flag2 Flag3
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080 0 0 0
2 KMT1 01-03-2015 10:22:05 0.003 0.002 0.008 0.009 0 0 0
3 KMT1 01-03-2015 10:22:05 0.040 0.020 0.060 0.080 0 0 0
4 KMT2 04-06-2014 10:25:05 0.030 0.003 0.090 0.090 1 0 0
5 KMT2 04-06-2014 10:25:05 0.020 0.002 0.004 0.004 0 0 0
6 KMT2 04-06-2014 10:25:05 0.050 0.090 0.050 0.050 1 1 0
7 KMT3 07-01-2019 10:23:05 0.030 0.010 0.010 0.050 0 1 1
8 KMT3 07-01-2019 10:23:05 0.100 0.050 0.003 0.030 0 0 1
9 KMT3 07-01-2019 10:23:05 0.020 0.023 0.040 0.100 1 0 0
I would like to be able to filter the df based on the ID column so that I can assess if and how many flags active on the ID (0/1) and if so, which flags. I then need to remove any ID group if a flag is active or remove 1 of the entries if the flag is not important e.g., only flag3 is active.
After filtering the data I would like to calculate the median of each ID, so that the df would have 2 rows. The desired output would look something like this:
(A) Removed ID(KMT3) as flag 3 was active
ID Date TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080
2 KMT2 04-06-2014 10:25:05 0.025 0.025 0.047 0.047
(B) Removed ID(KMT2 and KMT3) as flags are active
ID Date TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05 0.01 0.02 0.04 0.08
I am new to R and not sure on the best way to approach this problem, I have tried using filter() from dplyr package which can remove all rows which contain an active flag (e.g. Flag1 = 1), however, I need to assess which flags are active and how any before I can remove the ID.
I have managed to calculate the median of each column using:
DT1<-df2 %>%
select("ID",V1:V4)%>%
group_by(ID)%>%
data.table()%>%
na.omit()
setnames(DT1[, sapply(.SD, function(x) list(median(x))), by=ID], c("ID", sapply(names(DT1)[-1], paste0, c(".median"))))
However, doing this I lose the Date and Time columns which are important for further processes.
Any help would be greatly appreciated.
Here's some code that gets you 1 row per ID, including the total number of flags, the number of flags of each type, and the median of the V*
columns:
library(dplyr)
df1 |>
summarize(
across(starts_with("Flag"), sum, .names = "n_{.col}"),
across(V1:V4, median, .names = "med_{.col}"),
.by = c(ID, Date, TimeUTC)
) |>
mutate(
n_total_flags = rowSums(across(starts_with("n_"))),
)
# ID Date TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
# 1 KMT1 01-03-2015 10:22:05 0 0 0 0.01 0.020 0.04 0.08 0
# 2 KMT2 04-06-2014 10:25:05 2 1 0 0.03 0.003 0.05 0.05 3
# 3 KMT3 07-01-2019 10:23:05 1 1 2 0.03 0.023 0.01 0.05 4
You could use this result to do whatever logic you need to and join back to the original data to do filtering.