I'm trying to produce a weighted sum per factor level. I have four columns of data:
col1 = surface area
col 2 = dominant
col 3 = codominant
col 4 = sub
1 2 3 4
125 A NA NA
130 A NA B
150 C B NA
160 B NA NA
90 B A NA
180 C A B
So, for the above example output, my new dataframe would be:
1 2
A 326.9
B 331.4
C 134.4
I fiddled around with ifelse
and came op with something like (for two columns for this example):
df1 <- df %>%
mutate(weighted_dominant = ifelse(!is.na(dominant) & is.na(codominant), Surface_Area,
Surface_Area/2),
weighted_codominant = ifelse(!is.na(codominant), Surface_Area/2, NA )
Now i isolate the columns of intereset:
df2 <- df1 %>% select(dominant, weighted_dominant) %>%
group by (dominant) %>%
summarise (sum = sum(weighted_dominant)
also perform this for the codominant column, bind the rows of the two new dataframes and do the summarise function again.
This gets the job done, but also takes like 50 lines of code and is, in my opinion, not very clean.
My question: Are there better (tidyverse) ways to do this kind of weighted summarisation?
With tidyverse
you could consider the following approach.
Include row numbers as a separate column, so you can make evaluations within each row. The pivot_longer
will put your data into long format.
After grouping by row number, you can determine values for A, B, and C depending on which columns are missing. This assumes there is always a "dominant" column (otherwise, you could adjust the logic here).
Then, remove your NA
, and total up the weighted values for A, B, and C.
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = c(dominant, codominant, sub)) %>%
group_by(rn) %>%
mutate(weight = case_when(
is.na(value[name == "codominant"]) & is.na(value[name == "sub"]) ~ as.numeric(Surface_Area),
is.na(value[name == "codominant"]) & name == "dominant" ~ Surface_Area * .75,
is.na(value[name == "codominant"]) & name == "sub" ~ Surface_Area * .25,
is.na(value[name == "sub"]) ~ Surface_Area / 2,
TRUE ~ Surface_Area / 3
)) %>%
drop_na() %>%
group_by(value) %>%
summarise(total = sum(weight))
Output
value total
<chr> <dbl>
1 A 328.
2 B 372.
3 C 135