Search code examples
rexceldata-manipulationlarge-data

Massive group variables: Move column data ONE ROW UP within the group EFFICIENTLY [Panel data]


short questions: I want to move the data in one column one row up by group: within the group, the first row is replaced by second row data, second replace by third etc. The last row of the new column is 0. I have 40,000 grouping levels and 230,000 row in total.

Long questions: I have a panel data regarding consumer purchase, multiple users purchase at multiple times. The group variable is the consumer with brand choice data. I want to replicate a second choice data to calculate the repurchase rate for a certain brand of the same user. To define the repurchase : the repurchase of the "1st purchase" is the "2nd purchase"; And the repurchase of the "2nd purchase" is the "3rd purchase" The last purchase has no repurchase.

My code takes me 8 minutes for this simple data manipulations in r, but only 1 second in Excel.

Data is like(sorted by day shopper first and day):

    Day Shopper Choice
    1   A   Coke
    2   A   Coke
    1   B   Sprite
    1   C   Coke
    2   C   Pepsi
    3   C   Coke
    1   D   Sprite
    2   D   Sprite

The desired output:

Day Shopper Choice  choice 2
 1  A   apple        *apple*
 2  A   apple           0
 1  B   Banana          0
 1  C   apple          Banana
 2  C   Banana         apple 
 3  C   apple           0
 1  D   berry           *berry*
 2  D   berry           0

My original code is

# sort the data by user first and then by day
# choice.2 is the new column name
n<-nrow(dt) 
for (i in 1:n) {
 if (df$shopper[i]==dt$shopper[i+1]) 
{choice.2[i]<-choice[i+1]}
 else {choice.2[i]<-0}}

Solution

  • If you mean that you want to group by shopper and then lead Choice, filling with zeros, with dplyr,

    library(dplyr)
    
    df %>% group_by(Shopper) %>% mutate(choice2 = lead(as.character(Choice), default = '0'))
    ## Source: local data frame [8 x 4]
    ## Groups: Shopper [4]
    ## 
    ##     Day Shopper Choice choice2
    ##   <int>  <fctr> <fctr>   <chr>
    ## 1     1       A  apple   apple
    ## 2     2       A  apple       0
    ## 3     1       B Banana       0
    ## 4     1       C  apple  Banana
    ## 5     2       C Banana   apple
    ## 6     3       C  apple       0
    ## 7     1       D  berry   berry
    ## 8     2       D  berry       0 
    

    or with data.table, something like

    library(data.table)
    
    setDT(df)[, choice2 := shift(as.character(Choice), type = 'lead', fill = '0') , by = Shopper][]
    ##    Day Shopper Choice choice2
    ## 1:   1       A  apple   apple
    ## 2:   2       A  apple       0
    ## 3:   1       B Banana       0
    ## 4:   1       C  apple  Banana
    ## 5:   2       C Banana   apple
    ## 6:   3       C  apple       0
    ## 7:   1       D  berry   berry
    ## 8:   2       D  berry       0
    

    or in base,

    df$choice2 <- ave(as.character(df$Choice), df$Shopper, FUN = function(x){c(x[-1], '0')})
    df
    ##   Day Shopper Choice choice2
    ## 1   1       A  apple   apple
    ## 2   2       A  apple       0
    ## 3   1       B Banana       0
    ## 4   1       C  apple  Banana
    ## 5   2       C Banana   apple
    ## 6   3       C  apple       0
    ## 7   1       D  berry   berry
    ## 8   2       D  berry       0
    

    If Choice is a factor, all versions coerce choice2 to character instead, which comes with some time penalties. If you add "0" to the factor levels, the same approaches should work without coercion.