Search code examples
rdplyrgroup-bycumsum

Cumsum on investor and asset combination with a condition to restart in R


I have this dataframe

df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2" "INV_2"), 
ass = c("x", "x", "x", "y" "y", "x", "x", "x", "t", "t", "t"), 
datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"), 
portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2), 
G = (1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1), 
class = "data.frame", row.names = c(NA, -5L))

which represents investor transactions in financial markets, so I have 4k different investors IDs and 6k different assets. What I'm searching is a way to cumsum the variable G for each investor*asset combination. In particular I want that the cumsum() restart whenever that specific investor*asset combination is paired with a portfolio == 0.

So in the dataframe above I should get a new column called posdays which should be equal to:

posdays = (1, 1, 0, 0, 0, 1, 2, 3, 1, 1, 1)

where the first 3 entries refers to INV_1*X (notice the count restart in the third row because in the previous the portfolio == 0), the fourth and fifth to INV_1*Y then INV_2*X which cumsum the G variable for 3 times since the portfolio > 0, and the last three refers to INV_2*T where again the count restarts after the second entry since the portfolio == 0

I've tried something myself but I wasn't able to get what I'm looking for. My code is:

res <- res %>%
  group_by(group = cumsum(dplyr::lag(portfolio == 0, default = 0))) %>%
  mutate(posdays = cumsum(G)) %>%
  select(-group) %>% 
  ungroup

but in this way I'm not able to differenciate for investor and asset as I want. So basically I think I'm looking for a way to add a specification of investor*asset group_by in the previous code. But I have no idea of how since I have a low experience as an R user

Any idea?


Solution

  • Some minor corrections to your original dataframe:

    df <- structure(
      list(
        inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2"),
        ass = c("x", "x", "x", "y", "y", "x", "x", "x", "t", "t", "t"),
        datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
        G = c(1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
        portfolio = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2)
      ),
      class = "data.frame", row.names = c(NA, -11L)
    )
    

    It looks like you've got the right idea with your own code. The trick is to create and group by a new column. Don't forget to ensure your data are correctly ordered before performing cumsum.

    library(dplyr)
    
    df_new <- df |> 
      arrange(inv, ass, datetime) |> 
      group_by(inv, ass) |> 
      mutate(
        restart = lag(portfolio == 0, default = FALSE),
        group = cumsum(restart)
      ) |> 
      group_by(inv, ass, group) |> 
      mutate(pos_days = cumsum(G))