Search code examples
rtidyversetail

Clear the variables except first row in a group after separate row R


I have to achieve 3 things :

  1. Separate rows splitted by Col : "Date" contain different years.
  2. EXCEPT the Col :"Fix" , only keep FIRST row within a group has data.
  3. Calculate Col : "Price" in even divided by the the numbers of row splitted.

    # Dataset call df 
    
    Name Fix Price   Date 
    Tom  600  500     2016-01-01
    John 800  400     2016-01-03;2016-01-09
    Mary 1100 1000    2016-01-04;2017-09-01;2017-08-10;2018-01-01
    Joe  30   25      2016-01-04;2017-09-01
    Paul 400  304     2015-01-02
    Alex 505  238     2018-01-02;2018-02-02
    
    # Targeted df
    
    Name Fix  Price   Date 
    Tom  600  500     2016-01-01
    John 800  400     2016-01-03;2016-01-09
    Mary 1100 250     2016-01-04
    Mary 0    250     2017-09-01
    Mary 0    250     2017-08-10
    Mary 0    250     2018-01-01
    Joe  30   12.5    2016-01-04
    Joe  0    12.5    2017-09-01
    Paul 400  304     2015-01-02
    Alex 505  238     2018-01-02;2018-02-02
    

I have an approach to achieve 1 and 3 , but i cannot make 2 that i don't know any the Diverse function of fill().

# Find the SPLIT TARGET first :
inds <- sapply(strsplit(df$Date, ";"), function(x) 
length(unique(format(as.Date(x), "%Y"))) > 1) ### This approach actually 
does not works in my actual data when over 1 Million rows , i chunk it into a smaller data 
to fit this approach's limit.

library(tidyverse)
library(stringr)

#Select those indices 

df[inds, ] %>% mutate(Price = Price / (str_count(Date,";") + 1)) %>%
separate_rows(Date, sep = ";") %>%
bind_rows(df[!inds,])

*Please remind you cannot use Col : " Name " for something as they just represent that the dataset has other values those have to copy by separate_rows
How can i clear point 2 ? Its drive me crazy .Thanks for any comments in advances


Solution

  • dt = read.table(text = "
    Name Fix Price   Date 
                    Tom  600  500     2016-01-01
                    John 800  400     2016-01-03;2016-01-09
                    Mary 1100 1000    2016-01-04;2017-09-01;2017-08-10;2018-01-01
                    Joe  30   25      2016-01-04;2017-09-01
                    Paul 400  304     2015-01-02
                    Alex 505  238     2018-01-02;2018-02-02
                    ", header=T, stringsAsFactors=F)
    
    library(tidyverse)
    library(lubridate)
    
    dt %>%
      separate_rows(Date, sep=";") %>%                     # separate dates
      group_by(Name, year = year(ymd(Date))) %>%           # for each Name and year of the date
      summarise(Fix = unique(Fix),                         # keep Fix
                Price = unique(Price),                     # keep Price
                Date = paste0(Date, collapse = ";")) %>%   # combine dates with same year
      mutate(Fix = ifelse(row_number() > 1, 0, Fix),       # update Fix values
             Price = Price/length(Price)) %>%              # divide Price by number of rows
      ungroup()                                            # forget the grouping
    
    # # A tibble: 9 x 5
    #   Name   year   Fix Price Date                 
    #   <chr> <dbl> <dbl> <dbl> <chr>                
    # 1 Alex   2018   505 238   2018-01-02;2018-02-02
    # 2 Joe    2016    30  12.5 2016-01-04           
    # 3 Joe    2017     0  12.5 2017-09-01           
    # 4 John   2016   800 400   2016-01-03;2016-01-09
    # 5 Mary   2016  1100 333.  2016-01-04           
    # 6 Mary   2017     0 333.  2017-09-01;2017-08-10
    # 7 Mary   2018     0 333.  2018-01-01           
    # 8 Paul   2015   400 304   2015-01-02           
    # 9 Tom    2016   600 500   2016-01-01