I have a dataframe with 37 columns in it, with a representative sample, df
below
df <- structure(list(irm = 201201:201202, trans11 = c(379L, 433L),
trans12 = 4:3, trans13 = 5:4, trans14 = c(13L, 3L), trans15 = c(29L,
21L), trans16 = c(0L, 0L), trans21 = c(6L, 4L), trans22 = 2:1,
trans23 = c(0L, 0L), trans24 = 0:1, trans25 = c(0L, 0L),
trans26 = c(0L, 0L), trans31 = c(2L, 2L), trans32 = c(0L,
0L), trans33 = 5:6, trans34 = c(0L, 0L), trans35 = c(7L,
2L), trans36 = c(0L, 0L), trans41 = c(4L, 10L), trans42 = c(0L,
0L), trans43 = c(0L, 0L), trans44 = c(4L, 10L), trans45 = c(3L,
1L), trans46 = c(0L, 0L), trans51 = c(15L, 18L), trans52 = c(0L,
0L), trans53 = c(1L, 1L), trans54 = c(4L, 0L), trans55 = c(96L,
115L), trans56 = c(0L, 0L), trans61 = c(0L, 0L), trans62 = c(0L,
0L), trans63 = c(0L, 0L), trans64 = c(0L, 0L), trans65 = c(0L,
0L), trans66 = c(0L, 0L)), row.names = c(NA, -2L), class = c("data.table",
"data.frame"))
The dataframe has 37 columns: trans11
...trans16
...trans61
...trans66
plus the irm
column for month.
What I would like to do is the following:
for each row/column entry in trans11
all the way through trans66
, calculate the proportion of the entry relative to the sum of all other columns with the same prefix (e.g. trans1
). So for the example here, the first row in entries 2 through 7 would be: (0.8813953, 0.009302326, 0.01162791, 0.03023256, 0.06744186, 0)
for columns 2 through 7 (since we need to sum trans11
....trans16
)
How would I do this for all 36 columns in the larger df?
Is there a way to do this with group_by
and starts_with
from dplyr
? I know a for
loop is probably possible but any and all suggestions are welcome.
Thanks
Updated example with larger dataframe example
in base R use prop.table/propotions
:
cbind(df, prop = prop.table(as.matrix(df[-1]), 1))
irm trans11 trans12 trans13 trans14 trans15 trans16 prop.trans11 prop.trans12 prop.trans13 prop.trans14 prop.trans15 prop.trans16
1 201201 379 4 5 13 29 0 0.8813953 0.009302326 0.011627907 0.030232558 0.06744186 0
2 201202 433 3 4 3 21 0 0.9331897 0.006465517 0.008620690 0.006465517 0.04525862 0
3 201203 468 2 9 9 19 0 0.9230769 0.003944773 0.017751479 0.017751479 0.03747535 0
4 201204 514 3 9 8 15 0 0.9362477 0.005464481 0.016393443 0.014571949 0.02732240 0
5 201205 559 11 8 6 12 0 0.9379195 0.018456376 0.013422819 0.010067114 0.02013423 0
6 201206 566 5 9 6 16 0 0.9401993 0.008305648 0.014950166 0.009966777 0.02657807 0
7 201207 577 3 13 4 21 0 0.9336570 0.004854369 0.021035599 0.006472492 0.03398058 0
8 201208 622 7 14 10 15 0 0.9311377 0.010479042 0.020958084 0.014970060 0.02245509 0
9 201209 665 4 12 3 11 0 0.9568345 0.005755396 0.017266187 0.004316547 0.01582734 0
10 201210 692 4 11 7 15 0 0.9492455 0.005486968 0.015089163 0.009602195 0.02057613 0
11 201211 738 6 5 5 9 0 0.9672346 0.007863696 0.006553080 0.006553080 0.01179554 0
12 201212 830 8 8 4 11 0 0.9639954 0.009291521 0.009291521 0.004645761 0.01277584 0
If you need to select using trans
name:
cbind(df, prop = prop.table(as.matrix(df[startsWith(names(df), 'trans')]), 1))
With the edit of the data, use:
library(tidyverse)
df %>%
pivot_longer(starts_with('trans'), names_to = c('tr', 'grp'),
names_pattern = '(.*)(.)') %>%
mutate(value = value/sum(value), .by = c(irm, tr)) %>%
pivot_wider(names_from = c(tr, grp), names_sep = '', names_prefix = 'prop_')