Search code examples
rdplyrtidyrrbindmutate

Condensing wide dataframe with columnwise operations


I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.

I'm not sure if it's a loop that is supposed to be used for this, but basically for each V(X)_Type(X) column (excl. Proxy columns), I would like to subtract each row value from the first row stopping when Proxy_Type(X) hits 99999.

This probably sounds confusing in words so hopefully an example from df to the desired df2 will assist in visualising what I'm trying to ask.

Type1 <- c('ABC','DEF','GHI','JKL','MNO','PQR')
V1_Type1 <- c('1','0.5','1','0.2','3','4')
V2_Type1 <- c('2','0.5','1','0.3','3.2','4.1')
V3_Type1 <- c('3','0.6','1','0.4','3.3','4.3')
Proxy_Type1 <- c('0','99999','99999','99999','99999','99999')

Type2 <- c('DEF','GHI','JKL','MNO','PQR','STU')
V1_Type2 <- c('0.5','1','0.2','3','4','4.2')
V2_Type2 <- c('0.5','1','0.3','3.2','4.1','2.2')
V3_Type2 <- c('0.6','1','0.4','3.3','4.3','3')
Proxy_Type2 <- c('0','1.35','99999','99999','99999','99999')

Type3 <- c('GHI','JKL','MNO','PQR','STU','VWX')
V1_Type3 <- c('1','0.2','3','4','4.2','4.1')
V2_Type3 <- c('1','0.3','3.2','4.1','2.2','1.8')
V3_Type3 <- c('1','0.4','3.3','4.3','3','4.2')
Proxy_Type3 <- c('0','2.5','3','99999','99999','99999')



df <- data.frame(Type1,V1_Type1,V2_Type1,V3_Type1,Proxy_Type1,
                 Type2,V1_Type2,V2_Type2,V3_Type2,Proxy_Type2,
                 Type3,V1_Type3,V2_Type3,V3_Type3,Proxy_Type3)


To <- c('DEF','GHI','GHI')
From <- c('GHI','JKL','MNO')
V1 <- c('0.5','-0.8','2')
V2 <- c('0.5','-0.7','2.2')
V3 <- c('0.4','-0.6','2.3')

df2 <- data.frame(To,From,V1,V2,V3)

So in the desired df2 dataframe, you can see that there is no To and From "ABC" "DEF", since "DEF" Proxy is 99999 and it instantly skips to "DEF" to "GHI" then V1= (1-0.5), V2=(1-0.5), V3=(1-0.6) stopping at GHI since JKL has Proxy_Type2 = 99999. Moving onto Type3, "GHI" to "JKL" V1=(0.2-1), V2=(0.3-1), V3=(0.4-1) then "GHI" to "MNO" V1=(3-1), V2=(3.2-1), V3=(3.3-1). The process will stop here since PQR has 99999.

I have hundreds of these "Type" types of columns where I would like to produce such a calculation from df to df2 much too time consuming manually and would greatly appreciate your assistance.

Thankyou


Solution

  • It took some time to understand the question. The solution is simple if you first tidy the data. In this case, that means first pivoting your df longer so that Type is a variable and not part of each variable name.

    Assuming the data is ordered by row number (idx in the example below), you simply do the following within each Type (using group_by()): keep rows before first Proxy == 99999; do your calculation (current row minus first row); set To and From based on current and first row; and remove the first row. Then keeping the columns To, From, V1...Vx, gives you df2:

    library(dplyr)
    library(tidyr)
    
    df %>%
      mutate(idx = row_number()) %>%
      pivot_longer(-idx) %>%
      mutate(
        Type = gsub(".*Type(\\d).*", "\\1", name),
        name = name %>%
          gsub("_Type\\d", "", .) %>%
          gsub("^Type\\d+$", "tofrom", .)
      ) %>%
      pivot_wider() %>%
      arrange(Type, idx) %>%
      group_by(Type) %>%
      filter(row_number() < which(Proxy == 99999)[1]) %>%
      mutate(
        across(matches("V"), \(x) as.numeric(x) - as.numeric(x)[1]),
        To = tofrom[1],
        From = tofrom
      ) %>%
      filter(row_number() != 1) %>%
      ungroup() %>%
      select(To, From, matches("V"))
    
    # A tibble: 3 × 5
      To    From     V1    V2    V3
      <chr> <chr> <dbl> <dbl> <dbl>
    1 DEF   GHI     0.5   0.5   0.4
    2 GHI   JKL    -0.8  -0.7  -0.6
    3 GHI   MNO     2     2.2   2.3