Search code examples
rdataframedplyrdata-extraction

Using R to extract data from a dataframe and store data in unknown number of new columns


I have a dataframe as follows:

library(dplyr)
df <- data.frame(A=1:20, 
                  B=c(2,1.8,1.6,1.8,4,6,8,10,12,10,8,6,13,14,15,16,16.5,15,14,13))
mutate(df, C = B - lag(B))
A   B     C
1   2.0   NA
2   1.8 -0.2
3   1.6 -0.2
4   1.8  0.2
5   4.0  2.2
6   6.0  2.0
7   8.0  2.0
8   10.0  2.0
9   12.0  2.0
10  10.0 -2.0
11  8.0 -2.0
12  6.0 -2.0
13  13.0  7.0
14  14.0  1.0
15  15.0  1.0
16  16.0  1.0
17  16.5 -0.5
18  15.0 -1.0
19  14.0 -1.0
20  13.0 -1.0

And I want to extract out the sequences of minus values where there are 3 or more consecutively together and put in a separate column. So for example put the values from (col C)row 10, 11, 12 in a new column and row 17, 18,19,20 in another new column. This dataframe is huge so I dont know how many new columns I will have. Any help would be appreciated. Thanks


Solution

  • Here is an option with rleid to create a run-length-id grouping based on the sign of the column 'C' i.e. those adjacent elements with same sign will have same grouping 'id' and it gets incremented when there is a difference in sign. Then, we create the columns based on the count (n()) value to be particular number i.e. 3 or 4

    library(dplyr)
    library(data.table)
    df %>%
       mutate(C = B - lag(B)) %>%
      group_by(grp = rleid(sign(C))) %>%
      mutate(newC3 = if(n() ==3 && all(C < 0)) C else NA,
             newC4 = if(n() == 4 && all(C < 0) C else NA)
    

    To make this automated, an option is pivot_wider to reshape from 'long' to 'wide' format after creating the grouping id with rleid and replaceing the values that are not negative to NA. In this way, we get only the blocks of negative values to be in a separate column

    library(tidyr)
    library(stringr)
    df %>%
       mutate(C = B - lag(B)) %>%
       mutate(grp = str_c('C', rleid(sign(C))), 
         C1 = case_when(C >=0 ~ NA_real_, TRUE ~ C)) %>%
       pivot_wider(names_from = grp, values_from = C1)%>%
       select(where(~ sum(!is.na(.)) > 0))
    

    -output

    # A tibble: 20 x 6
    #       A     B      C     C2    C4    C7
    #   <int> <dbl>  <dbl>  <dbl> <dbl> <dbl>
    # 1     1   2   NA     NA        NA    NA
    # 2     2   1.8 -0.200 -0.200    NA    NA
    # 3     3   1.6 -0.200 -0.200    NA    NA
    # 4     4   1.8  0.200 NA        NA    NA
    # 5     5   4    2.2   NA        NA    NA
    # 6     6   6    2     NA        NA    NA
    # 7     7   8    2     NA        NA    NA
    # 8     8  10    2     NA        NA    NA
    # 9     9  12    2     NA        NA    NA
    #10    10  10   -2     NA        -2    NA
    #11    11   8   -2     NA        -2    NA
    #12    12   6   -2     NA        -2    NA
    #13    13  13    7     NA        NA    NA
    #14    14  14    1     NA        NA    NA
    #15    15  15    1     NA        NA    NA
    #16    16  16    1     NA        NA    NA
    #17    17  16    0     NA        NA    NA
    #18    18  15   -1     NA        NA    -1
    #19    19  14   -1     NA        NA    -1
    #20    20  13   -1     NA        NA    -1
    

    NOTE: The column names 'C2', 'C4', 'C7' are based on the ids created with rleid. If we wanted to rename, then it can be done with rename_with or rename_at

    ...
      %>%
       rename_at(vars(matches('^C\\d+')), ~ str_c('C', seq_along(.)))