Search code examples
rdataframereshapedata-cleaning

Attaching rows together if they share same id and creating new columns in R


Let's suppose we have the dataframe below:

df <- read.table(header=T, text=
'Patient_ID    Gene         Type
1           ATM             3
1           MEN1            1
2           BRCA1           3
2           RAD51C          2
2           BRCA2           2
3           CHEK2           1
4           MUTYH           1
4           BRCA2           3', stringsAsFactors=F)

How can I rearrange this dataframe to make it look like the following :

ID  ATM MEN1 BRCA1  RAD51C  CHEK2   MUTYH   BRCA2
1    3    1                 
2             3       2                       2
3                             1     
4                                      1      3

Please note that each row is now a unique case and the column Type was used to provide the values for the new created columns.


Solution

  • Your data is long/tidy. You want it to be wide. There are many functions to do this in R. A commonly used one is tidyr::pivot_wider(), which I demonstrate below:

    library(tidyverse)
    
    
    df <- read.table(header=T, text=
                       'Patient_ID    Gene         Type
    1           ATM             3
    1           MEN1            1
    2           BRCA1           3
    2           RAD51C          2
    2           BRCA2           2
    3           CHEK2           1
    4           MUTYH           1
    4           BRCA2           3', stringsAsFactors=F)
    
    # Blank cells will be NA
    df |> 
      rename(ID = Patient_ID) |> 
      pivot_wider(names_from = Gene,
                  values_from = Type) 
    #> # A tibble: 4 × 8
    #>      ID   ATM  MEN1 BRCA1 RAD51C BRCA2 CHEK2 MUTYH
    #>   <int> <int> <int> <int>  <int> <int> <int> <int>
    #> 1     1     3     1    NA     NA    NA    NA    NA
    #> 2     2    NA    NA     3      2     2    NA    NA
    #> 3     3    NA    NA    NA     NA    NA     1    NA
    #> 4     4    NA    NA    NA     NA     3    NA     1
    
    # Blank cells as empty strings ("")
    df |> 
      rename(ID = Patient_ID) |> 
      pivot_wider(names_from = Gene, 
                  values_from = Type, 
                  values_fn = as.character, 
                  values_fill = "")
    #> # A tibble: 4 × 8
    #>      ID ATM   MEN1  BRCA1 RAD51C BRCA2 CHEK2 MUTYH
    #>   <int> <chr> <chr> <chr> <chr>  <chr> <chr> <chr>
    #> 1     1 "3"   "1"   ""    ""     ""    ""    ""   
    #> 2     2 ""    ""    "3"   "2"    "2"   ""    ""   
    #> 3     3 ""    ""    ""    ""     ""    "1"   ""   
    #> 4     4 ""    ""    ""    ""     "3"   ""    "1"
    

    Created on 2022-05-23 by the reprex package (v2.0.1)

    EDIT: Second solution simplified in line with @DarrenTsai's comment