Search code examples
rcalculated-columns

How to sum values in multiple rows to a new column in R?


My dataframe:

structure(list(Observation = c("Apple", "Blueberry", "Cirtus", 
"Dates", "Eggplant"), Topic = 1:5, Gamma = c(0.1, 0.1, 0.2, 0.2, 
0.1)), class = "data.frame", row.names = c(NA, -5L))

  Observation Topic Gamma
1       Apple     1   0.1
2   Blueberry     2   0.1
3      Cirtus     3   0.2
4       Dates     4   0.2
5    Eggplant     5   0.1

How can I tell R to add the values of 1, 3, and 5, and 2 and 4, and then save it in a new column? For example:

Observation Topic Gamma new variable
Apple 1 .10 .40
Blueberry 2 .10 .30
Cirtus 3 .20 .40
Dates 4 .20 .30
Eggplant 5 .10 .40

Essentially, I'd like each observation to have a new value that sums up the gamma scores of topics 1, 3, and 5, as well as topics 2 and 4.

Update: Clarification: I am not trying to add even topic numbers or odd topic numbers. Sometimes it will be a mixture of both. See this new table as an example:

Observation Topic Gamma new variable
Apple 1 .10 .10
Blueberry 2 .10 .70
Cirtus 3 .20 .40
Dates 4 .20 .40
Eggplant 5 .10 .70
Fruits 6 .50 .70

In this example, I left topic 1 alone, added topics 2, 5, and 6, and added topics 3 and 4.

Update: Clarification:

Observation Topic Gamma new variable
Apple 1 .10 .10
Apple 2 .10 .70
Apple 3 .20 .40
Apple 4 .20 .40
Apple 5 .10 .70
Apple 6 .50 .70
Blueberry 1 .20 .20
Blueberry 2 .10 .60
Blueberry 3 .30 .80
Blueberry 4 .50 .80
Blueberry 5 .40 .60
Blueberry 6 .10 .60

In this example, Each fruit (observation) has their own set of values for each topic and I summed the same topics as listed above (2, 5, and 6, 3 and 4) per fruit.


Solution

  • Update II on new request:

    library(dplyr)
    
    df %>% 
      group_by(Observation, grp = case_when(Topic %in% 1 ~ 1,
                               Topic %in% c(2,5,6) ~ 2,
                               Topic %in% c(3,4) ~ 3)) %>% 
      mutate(new_variable = sum(Gamma)) %>% 
      ungroup %>% 
      select(-grp)
    
      Observation Topic Gamma new_variable
       <chr>       <int> <dbl>        <dbl>
     1 Apple           1   0.1          0.1
     2 Apple           2   0.1          0.7
     3 Apple           3   0.2          0.4
     4 Apple           4   0.2          0.4
     5 Apple           5   0.1          0.7
     6 Apple           6   0.5          0.7
     7 Blueberry       1   0.2          0.2
     8 Blueberry       2   0.1          0.6
     9 Blueberry       3   0.3          0.8
    10 Blueberry       4   0.5          0.8
    11 Blueberry       5   0.4          0.6
    12 Blueberry       6   0.1          0.6
    

    Update: on new request of OP. This solution is inspired fully by PaulS solution (credits to him):

    library(dplyr)
    
    df %>% 
      group_by(grp = case_when(Topic %in% 1 ~ 1,
                               Topic %in% c(2,5,6) ~ 2,
                               Topic %in% c(3,4) ~ 3)) %>% 
      mutate(new_variable = sum(Gamma)) %>% 
      ungroup %>% 
      select(-grp)
    
      Observation Topic Gamma new_variable
      <chr>       <int> <dbl>        <dbl>
    1 Apple           1   0.1          0.1
    2 Blueberry       2   0.1          0.7
    3 Cirtus          3   0.2          0.4
    4 Dates           4   0.2          0.4
    5 Eggplant        5   0.1          0.7
    6 Fruits          6   0.5          0.7
    

    First answer: We could sum Gamma after identifying odd and even rows in an ifelse statement: In this case row_number could be replaced by Topic

    library(dplyr)
    
    df %>% 
      mutate(new_variable = ifelse(row_number() %% 2 == 1, 
                                   sum(Gamma[row_number() %% 2 == 1]), # odd 1,3,5
                                   sum(Gamma[row_number() %% 2 == 0])) # even 2,4
             )
    
      Observation Topic Gamma new_variable
    1       Apple     1   0.1          0.4
    2   Blueberry     2   0.1          0.3
    3      Cirtus     3   0.2          0.4
    4       Dates     4   0.2          0.3
    5    Eggplant     5   0.1          0.4
    

    data:

    structure(list(Observation = c("Apple", "Blueberry", "Cirtus", 
    "Dates", "Eggplant"), Topic = 1:5, Gamma = c(0.1, 0.1, 0.2, 0.2, 
    0.1)), class = "data.frame", row.names = c(NA, -5L))
    

    Microbenchmark: AndrewGB's base R is fastest

    enter image description here