I intend to transpose the datatable DT_in
so that the columns indv1, ...
should be the rows and POS
as the columns. So for each indv
, there would be a named column from the POS
rows, which contains the corresponding values (1,2 and 0). I actually could manage this part through tidyr::pivot
function. The problem is, I need to do the whole job for each group of IDs in column ID
. So the output will be a list so that for each group of IDs in ID
as the first element of the list, the transposed table will be a sublist.
> DT_in
CHR POS ID indv1 indv2 indv3
1 31439_T_A ENSG00000232325 1 1 2
1 31504_G_A ENSG00000242474 2 1 1
1 34713_A_C ENSG00000242474 1 0 1
1 34918_C_T ENSG00000242474 1 0 1
1 35119_G_A ENSG00000239715 0 0 1
what I tried:
DT_in %>%
group_by(ID) %>%
pivot_longer(
cols = starts_with("indv")
)
View(DT_in)
CHR POS ID name value
<dbl> <chr> <chr> <chr> <dbl>
1 1 31439_T_A ENSG00000232325 indv1 1
2 1 31439_T_A ENSG00000232325 indv2 1
3 1 31439_T_A ENSG00000232325 indv3 2
4 1 31504_G_A ENSG00000242474 indv1 2
5 1 31504_G_A ENSG00000242474 indv2 1
6 1 31504_G_A ENSG00000242474 indv3 1
7 1 34713_A_C ENSG00000242474 indv1 1
8 1 34713_A_C ENSG00000242474 indv2 0
9 1 34713_A_C ENSG00000242474 indv3 1
10 1 34918_C_T ENSG00000242474 indv1 1
11 1 34918_C_T ENSG00000242474 indv2 0
12 1 34918_C_T ENSG00000242474 indv3 1
13 1 35119_G_A ENSG00000239715 indv1 0
14 1 35119_G_A ENSG00000239715 indv2 0
15 1 35119_G_A ENSG00000239715 indv3 1
pivot_wider(pivot, names_from = POS, values_from = value)
Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In nrow * ncol : NAs produced by integer overflow
The error shown up when I wanted to do the job without building the list and also because the number of rows in the column POS
are not the same for each IDs in the ID
column. That's why I believe it should be done as a list.
UPDATE: Desired output:
> DF_OUT1
CHR ID name 31439_T_A
1 ENSG00000232325 indv1 1
1 ENSG00000232325 indv2 1
1 ENSG00000232325 indv3 2
> DF_OUT2
CHR ID name 31504_G_A 34713_A_C 34918_C_T
1 ENSG00000242474 indv1 2 1 1
1 ENSG00000242474 indv2 1 0 0
1 ENSG00000242474 indv3 1 1 1
> DF_OUT3
CHR ID name 35119_G_A
1 ENSG00000239715 indv1 0
1 ENSG00000239715 indv2 0
1 ENSG00000239715 indv3 1
As it is seen, I need the outputs to be stored for each IDs separetly either through a loop that gives separate dataframes or in a list of elements named by each unique IDs.
The split
function spits a data.frame into a list based on a function or variable. We can take your data.frame, pivot it into long form, split it into a list of dataframes based on ID, then pivot each of them wide again using lapply
to get a list of data.frames:
library(tidyverse)
df %>%
pivot_longer(starts_with('indv')) %>%
split(.$ID) %>%
lapply(., function(x) pivot_wider(x, names_from = 'POS'))
$ENSG00000232325
# A tibble: 3 × 4
CHR ID name `31439_T_A`
<int> <chr> <chr> <int>
1 1 ENSG00000232325 indv1 1
2 1 ENSG00000232325 indv2 1
3 1 ENSG00000232325 indv3 2
$ENSG00000239715
# A tibble: 3 × 4
CHR ID name `35119_G_A`
<int> <chr> <chr> <int>
1 1 ENSG00000239715 indv1 0
2 1 ENSG00000239715 indv2 0
3 1 ENSG00000239715 indv3 1
$ENSG00000242474
# A tibble: 3 × 6
CHR ID name `31504_G_A` `34713_A_C` `34918_C_T`
<int> <chr> <chr> <int> <int> <int>
1 1 ENSG00000242474 indv1 2 1 1
2 1 ENSG00000242474 indv2 1 0 0
3 1 ENSG00000242474 indv3 1 1 1
For a bit of extra context, here's an explanation for why we need to refer to the ID column as .$ID
in split when using the pipe: Why dot in base::split(.$cyl)?