Search code examples
rlistsplitdatatabletranspose

Customize DataTable structure to a List


I intend to transpose the datatable DT_in so that the columns indv1, ... should be the rows and POS as the columns. So for each indv, there would be a named column from the POS rows, which contains the corresponding values (1,2 and 0). I actually could manage this part through tidyr::pivot function. The problem is, I need to do the whole job for each group of IDs in column ID. So the output will be a list so that for each group of IDs in ID as the first element of the list, the transposed table will be a sublist.

> DT_in
CHR      POS          ID         indv1  indv2  indv3  
1   31439_T_A   ENSG00000232325     1    1    2
1   31504_G_A   ENSG00000242474     2    1    1
1   34713_A_C   ENSG00000242474     1    0    1
1   34918_C_T   ENSG00000242474     1    0    1
1   35119_G_A   ENSG00000239715     0    0    1

what I tried:

DT_in %>% 
  group_by(ID) %>% 
  pivot_longer(
  cols = starts_with("indv")
)

View(DT_in)
  CHR POS       ID              name  value
 <dbl> <chr>     <chr>           <chr> <dbl>
 1     1 31439_T_A ENSG00000232325 indv1     1
 2     1 31439_T_A ENSG00000232325 indv2     1
 3     1 31439_T_A ENSG00000232325 indv3     2
 4     1 31504_G_A ENSG00000242474 indv1     2
 5     1 31504_G_A ENSG00000242474 indv2     1
 6     1 31504_G_A ENSG00000242474 indv3     1
 7     1 34713_A_C ENSG00000242474 indv1     1
 8     1 34713_A_C ENSG00000242474 indv2     0
 9     1 34713_A_C ENSG00000242474 indv3     1
10     1 34918_C_T ENSG00000242474 indv1     1
11     1 34918_C_T ENSG00000242474 indv2     0
12     1 34918_C_T ENSG00000242474 indv3     1
13     1 35119_G_A ENSG00000239715 indv1     0
14     1 35119_G_A ENSG00000239715 indv2     0
15     1 35119_G_A ENSG00000239715 indv3     1

pivot_wider(pivot, names_from = POS, values_from = value)

Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In nrow * ncol : NAs produced by integer overflow

The error shown up when I wanted to do the job without building the list and also because the number of rows in the column POS are not the same for each IDs in the ID column. That's why I believe it should be done as a list.

UPDATE: Desired output:

> DF_OUT1
CHR        ID       name    31439_T_A
 1 ENSG00000232325 indv1     1         
 1 ENSG00000232325 indv2     1 
 1 ENSG00000232325 indv3     2 

> DF_OUT2
CHR        ID       name    31504_G_A     34713_A_C    34918_C_T
 1 ENSG00000242474 indv1     2                 1           1
 1 ENSG00000242474 indv2     1                 0           0
 1 ENSG00000242474 indv3     1                 1           1

> DF_OUT3
CHR        ID       name    35119_G_A
 1 ENSG00000239715 indv1     0         
 1 ENSG00000239715 indv2     0 
 1 ENSG00000239715 indv3     1

As it is seen, I need the outputs to be stored for each IDs separetly either through a loop that gives separate dataframes or in a list of elements named by each unique IDs.


Solution

  • The split function spits a data.frame into a list based on a function or variable. We can take your data.frame, pivot it into long form, split it into a list of dataframes based on ID, then pivot each of them wide again using lapply to get a list of data.frames:

    library(tidyverse)
    df %>%
        pivot_longer(starts_with('indv')) %>%
        split(.$ID) %>%
        lapply(., function(x) pivot_wider(x, names_from = 'POS'))
    
    $ENSG00000232325
    # A tibble: 3 × 4
        CHR ID              name  `31439_T_A`
      <int> <chr>           <chr>       <int>
    1     1 ENSG00000232325 indv1           1
    2     1 ENSG00000232325 indv2           1
    3     1 ENSG00000232325 indv3           2
    
    $ENSG00000239715
    # A tibble: 3 × 4
        CHR ID              name  `35119_G_A`
      <int> <chr>           <chr>       <int>
    1     1 ENSG00000239715 indv1           0
    2     1 ENSG00000239715 indv2           0
    3     1 ENSG00000239715 indv3           1
    
    $ENSG00000242474
    # A tibble: 3 × 6
        CHR ID              name  `31504_G_A` `34713_A_C` `34918_C_T`
      <int> <chr>           <chr>       <int>       <int>       <int>
    1     1 ENSG00000242474 indv1           2           1           1
    2     1 ENSG00000242474 indv2           1           0           0
    3     1 ENSG00000242474 indv3           1           1           1
    
    

    For a bit of extra context, here's an explanation for why we need to refer to the ID column as .$ID in split when using the pipe: Why dot in base::split(.$cyl)?