Search code examples
rrowrename

Sum up rows by partial match of name and rename based on a name database


after a long time I am working again with R and thus I am a bit rusty. Every bit of help is appreciated.

So I have a survey out there which contains many matrix questions, even dual matrix. In this example it is a dual matirx is the survey question come formatted in Question.AnswerCode..FirstOrSecondMatrix. I made a frequency data frame who looks like this: Where I come from

subsetting from my original data for reproduction - columns in german, not sorted

dput(PIL)
structure(list(gering = c(12L, 9L, 12L, 3L, 12L, 8L, 3L, 10L, 
12L, 6L, 7L, 7L), hoch = c(32L, 45L, 28L, 43L, 39L, 44L, 20L, 
20L, 17L, 20L, 28L, 18L), `keine Angabe` = c(6L, 20L, 12L, 30L, 
9L, 24L, 9L, 16L, 10L, 17L, 10L, 17L), mittel = c(27L, 32L, 15L, 
27L, 23L, 27L, 19L, 18L, 20L, 25L, 10L, 24L), `sehr gering` = c(17L, 
10L, 28L, 1L, 14L, 2L, 1L, 1L, 3L, 2L, 15L, 4L), `sehr hoch` = c(22, 
0, 21, 12, 19, 11, 18, 5, 8, 0, 0, 0), Summe = c(116, 116, 116, 
116, 116, 116, 70, 70, 70, 70, 70, 70)), row.names = c("PIL.PILK1..1.", 
"PIL.PILK1..2.", "PIL.PILK2..1.", "PIL.PILK2..2.", "PIL.PILK3..1.", 
"PIL.PILK3..2.", "PIL2.PILK1..1.", "PIL2.PILK1..2.", "PIL2.PILK2..1.", 
"PIL2.PILK2..2.", "PIL2.PILK3..1.", "PIL2.PILK3..2."), class = "data.frame")

and as I result I try to achieve this: what I try to achive

I would like to join(sum) the rows by the middle part of the string "PILK1" = Prozess1 and rename it into "Prozess 1..1." or Prozess1..2.". Then renaming the last step "..1." with grepl to the desired output i am pretty sure to get this done since it has only two options. But all my searching has not helped me with the renaming a part of a string from a second data frame with the partial strings and the name of the process. The names list(data frame) contains 100+ names as abbrev strings and full names.

Where I am stuck

  • How to identify a row by partial match and sum those up
  • How to replace a string up to a certain point of a string using a list or data frame (picture 3)

Thanks a lot

PS: My workaround up to now is directly addressing it i.e. Prozess1_Intensity<- table(ifelse(!is.na(PIL.PILK1..1.),PIL.PILK1..1., PIL2.PILK1..1.), useNA = "always") and bind all items into a data frame. But this is a lot of copy and paste or in my case chaining everything together in Excel and paste it into R, which becomes tedious if applied to the whole survey


Solution

  • If the lookup table to match the process name with it's abbrevation is called as lookup.

    library(dplyr)
    library(tidyr)
    
    lookup <- data.frame(prozess_name = c('Prozess1', 'Prozess2', 'Prozess3'), 
                         abbrev = c('PILK1', 'PILK2', 'PILK3'))
    
    lookup
    
    #  prozess_name abbrev
    #1     Prozess1  PILK1
    #2     Prozess2  PILK2
    #3     Prozess3  PILK3
    

    You can divide the rownames in 3 separate columns using extract, join with lookup and sum the column values using across.

    PIL %>%
      rownames_to_column('abbrev') %>%
      extract(abbrev, c('PIL','abbrev', 'num'), '(PIL\\d?)\\.(PILK\\d+)\\.\\.(\\d+)\\.') %>%
      left_join(lookup, by = 'abbrev') %>%
      group_by(prozess_name, num) %>%
      summarise(across(gering:Summe, sum, na.rm = TRUE)) %>%
      ungroup
    
    #  prozess_name num   gering  hoch `keine Angabe` mittel `sehr gering` `sehr hoch` Summe
    #  <chr>        <chr>  <int> <int>          <int>  <int>         <int>       <dbl> <dbl>
    #1 Prozess1     1         15    52             15     46            18          40   186
    #2 Prozess1     2         19    65             36     50            11           5   186
    #3 Prozess2     1         24    45             22     35            31          29   186
    #4 Prozess2     2          9    63             47     52             3          12   186
    #5 Prozess3     1         19    67             19     33            29          19   186
    #6 Prozess3     2         15    62             41     51             6          11   186