Search code examples
rdataframedplyrtidyrreshape2

How to turn column data into column names?


I was given a spectroscopy reference database that is very different from the sample dataset I'm using. In my data, 0 or 1 indicate if there's a peak or not, while in the reference database the peak position is listed as a row value and grouped as a column in terms of peptide data (which I don't need).

My dataset looks like this:

Sample   1110   1111   1112
1         1        0      0
2         1        0      1
3         0        1      1
4         1        1      1

While the reference database obnoxiously looks like this (note the multiple values per column):

Species     peptide1   peptide2    peptide3
cow           1110        1112         NA
sheep         1111        1112         NA
goat           NA         1113        1114

The desired output would look similar to my dataset:

Species   1110    1111    1112    1113    1114
cow         1        0      1        0      0
sheep       0        1      1        0      0
goat        0        0      0        1      1

These are oversimplified but they illustrate the problem - let me know if I need to provide actual data. I need to both transpose/sort the values in the cells, while replacing the original values in a binary (again, I don't need to keep the peptide column names in the reference database). I'm really hoping there's an easy dplyr or tidyr trick for this - I imagine a spread function could work, but I wouldn't know how to do it for multiple columns nor how to leave the original data. Alternatively, I could manually append all of the data as a long format, then melt/cast it into a wider format?


Solution

  • I hope this is what you are looking for:

    library(dplyr)
    library(tidyr)
    
    df %>%
      pivot_longer(!Species) %>%
      mutate(val = 1) %>%
      select(-name) %>%
      drop_na() %>%
      arrange(value) %>%
      pivot_wider(names_from = value, values_from = val) %>%
      mutate(across(!Species, ~ replace_na(., 0)))
    
    # A tibble: 3 x 6
      Species `1110` `1111` `1112` `1113` `1114`
      <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
    1 cow          1      0      1      0      0
    2 sheep        0      1      1      0      0
    3 goat         0      0      0      1      1
    
    

    Data

    df <- tribble(
      ~Species,     ~peptide1,   ~peptide2,    ~peptide3,
      "cow",           1110,        1112,         NA,
      "sheep",         1111,        1112,         NA,
      "goat",           NA,         1113,        1114
    )
    

    Simplifying the above syntax by using the really useful arguments of pivot_longer() & pivot_wider() as

    df %>% 
      pivot_longer(!Species, values_drop_na = TRUE) %>% 
      pivot_wider(id_cols = Species, names_from = value, names_sort = TRUE, values_fill = 0, values_fn = length)
    
    # A tibble: 3 x 6
      Species `1110` `1111` `1112` `1113` `1114`
      <chr>    <int>  <int>  <int>  <int>  <int>
    1 cow          1      0      1      0      0
    2 sheep        0      1      1      0      0
    3 goat         0      0      0      1      1