Search code examples
rdataframedplyrsummarize

summarize from two differents rows


This is my starting df

test <- data.frame(year = c(2018,2018,2018,2018,2018), 
                    source = c("file1", "file1", "file1", "file1", "file1"),
                    area = c("000", "000", "800", "800", "800"),
                    cult2 = c("PBGEX", "QPGEX", "PBGEX", "QPGEX", "QPIND"), 
                    value = c(1000,2000,3000,4000,5000))

  year source area cult2 value
1 2018  file1  000 PBGEX  1000
2 2018  file1  000 QPGEX  2000
3 2018  file1  800 PBGEX  3000
4 2018  file1  800 QPGEX  4000
5 2018  file1  800 QPIND  5000

I need to get for each year/source/area the sum of value, for the fields PBGEX and QPGEX. I was thinking of using spread and gather but I m losing many others columns (not show here).

This what I would except :

  year source area cult2 value
1 2018  file1  000 PBGEX  1000
2 2018  file1  000 QPGEX  2000
3 2018  file1  800 PBGEX  3000
4 2018  file1  800 QPGEX  4000
5 2018  file1  800 QPIND  5000
6 2018  file1  000 RDGEX  3000
7 2018  file1  800 RDGEX  7000

Solution

  • We can filter the rows where the 'cult2' are 'QPGEX', 'PBGEX', then do a group_by sum and bind_rows with original dataset

    library(dplyr)
    test %>%
        filter(cult2 %in% c("QPGEX", "PBGEX")) %>% 
        group_by(year, source, area) %>%
        summarise(cult2 = "RDGEX", value = sum(value), .groups = 'drop') %>%
        bind_rows(test, .)
    

    -output

    #   year source area cult2 value
    #1 2018  file1  000 PBGEX  1000
    #2 2018  file1  000 QPGEX  2000
    #3 2018  file1  800 PBGEX  3000
    #4 2018  file1  800 QPGEX  4000
    #5 2018  file1  800 QPIND  5000
    #6 2018  file1  000 RDGEX  3000
    #7 2018  file1  800 RDGEX  7000
    

    If we need a proportion column

    test %>%
     filter(cult2 %in% c("QPGEX", "PBGEX")) %>% 
     group_by(year, source, area) %>%
     group_by(prop = value[cult2== "QPGEX"]/value[cult2 == "PBGEX"],
            .add = TRUE) %>% 
     summarise(cult2 = "RDGEX", value = sum(value), .groups = 'drop') %>% 
     bind_rows(test, .)
    

    Or it can be also

    library(tidyr)
    test %>% 
       filter(cult2 %in% c("QPGEX", "PBGEX")) %>%
       pivot_wider(names_from = cult2, values_from = value) %>% 
       # or use spread
       #spread(cult2, value) %>%
       mutate(prop = QPGEX/PBGEX) %>% 
       select(-PBGEX, -QPGEX) %>%
       right_join(test)
    

    -output

    # A tibble: 5 x 6
    #   year source area   prop cult2 value
    #  <dbl> <chr>  <chr> <dbl> <chr> <dbl>
    #1  2018 file1  000    2    PBGEX  1000
    #2  2018 file1  000    2    QPGEX  2000
    #3  2018 file1  800    1.33 PBGEX  3000
    #4  2018 file1  800    1.33 QPGEX  4000
    #5  2018 file1  800    1.33 QPIND  5000