Search code examples
rwide-format-data

How to transform long to wide data, while adding extra rows with values from further columns


I have been looking for a solution to my problem but have not been successful. I assume because it is an uncommon wish in terms of data analysis, good practice, and how to properly store data. I need the data in the specified format for further analysis in Primer. With small datasets, the second step (adding the factors) is easy to do by hand but with bigger ones it becomes more than tedious. Therefore, I would like to automate it.

The overall goal is to take long data and transform it to wide data, while additionally adding rows at the bottom which represent factors and are based on columns in the long data.

I have a data frame that looks like the following:

data <- data.frame(nr = c(1, 2, 3, 4, 5, 6, 7, 8), 
                 year = c(2013, 2013, 2013, 2013, 2022, 2022, 2022, 2022), 
                 depth = c(35, 35, 50, 50, 35, 35, 50, 50), 
                 species = c("A", "B", "A", "D", "C", "B", "D", "A"), 
                 area = c(1.0, 0.5, 3.2, 4.3, 2.0, 5.6, 1.8, 2.3))

The output that I am trying to achieve should look like this:

enter image description here

The first row represents names that are a combination of the factors used (year, depth) and the replicate number (nr). The following rows are the wide format of the species with their respective values (area). At the bottom are rows with the factors (year and depth), as well as the interaction between the two (year x depth).

With the following code, I transformed the data to wide format and included the correct names per column. It is missing the factors at the bottom though.

primer <- data %>%
  pivot_wider(names_from = c(year, depth, nr), values_from = area) %>% 
  mutate(across(.cols = everything(), ~replace_na(.x, 0))) %>% 
  as.data.frame()

Since the factors are also included in the names, I was thinking to extract them and save them as a vector. Afterwards you could extract only the parts that you need (one time the year, and one time the depth) and row bind it to the "primer" data frame. While this might work (not sure how), it does not seem like the best option since the function would have to be adjusted each time to accommodate for different factors and also different numbers of factors.

I am looking for a more universal function to solve my problem.

Thank you for your help!


Solution

  • The trick is to mutate "further" columns, pivot_wider area and than transpose the dataframe.

    Caution: all column types are character

    library(dplyr)
    library(tidyr)
    
    data %>% 
      mutate(col_name=paste(year,depth,nr,sep="_"),
             year_x_depth=paste(year,depth,sep="_")) %>%
      pivot_wider(names_from=species,
                  values_from=area,
                  values_fill=0) %>% 
      tibble::column_to_rownames(var="col_name") %>% 
      t %>%
      as.data.frame
    
                 2013_35_1 2013_35_2 2013_50_3 2013_50_4 2022_35_5 2022_35_6
    nr                   1         2         3         4         5         6
    year              2013      2013      2013      2013      2022      2022
    depth               35        35        50        50        35        35
    year_x_depth   2013_35   2013_35   2013_50   2013_50   2022_35   2022_35
    A                  1.0       0.0       3.2       0.0       0.0       0.0
    B                  0.0       0.5       0.0       0.0       0.0       5.6
    D                  0.0       0.0       0.0       4.3       0.0       0.0
    C                    0         0         0         0         2         0
                 2022_50_7 2022_50_8
    nr                   7         8
    year              2022      2022
    depth               50        50
    year_x_depth   2022_50   2022_50
    A                  0.0       2.3
    B                  0.0       0.0
    D                  1.8       0.0
    C                    0         0
    

    Updates (reorder column):

    data %>% 
      mutate(col_name=paste(year,depth,nr,sep="_"),
             year_x_depth=paste(year,depth,sep="_")) %>%
      pivot_wider(names_from=species,
                  values_from=area,
                  values_fill=0) %>% 
      relocate(any_of(c("A","B","D","C")))%>%
      tibble::column_to_rownames(var="col_name") %>% 
      t %>%
      as.data.frame
    
                 2013_35_1 2013_35_2 2013_50_3 2013_50_4 2022_35_5 2022_35_6 2022_50_7 2022_50_8
    A                  1.0       0.0       3.2       0.0       0.0       0.0       0.0       2.3
    B                  0.0       0.5       0.0       0.0       0.0       5.6       0.0       0.0
    D                  0.0       0.0       0.0       4.3       0.0       0.0       1.8       0.0
    C                    0         0         0         0         2         0         0         0
    nr                   1         2         3         4         5         6         7         8
    year              2013      2013      2013      2013      2022      2022      2022      2022
    depth               35        35        50        50        35        35        50        50
    year_x_depth   2013_35   2013_35   2013_50   2013_50   2022_35   2022_35   2022_50   2022_50