Search code examples
rsplitdplyrrowtidyverse

Separate comma separated values different lines


I have data like this.

structure(list(structureId = c("1JDN", "1DP4", "1XS5", "1SW1", 
"1P99", "1IXH"), structureTitle = c("Crystal Structure of Hormone Receptor", 
"DIMERIZED HORMONE BINDING DOMAIN OF THE ATRIAL NATRIURETIC PEPTIDE RECEPTOR", 
"The Crystal Structure of Lipoprotein Tp32 from Treponema pallidum", 
"Crystal structure of ProX from Archeoglobus fulgidus in complex with proline betaine", 
"1.7A crystal structure of protein PG110 from Staphylococcus aureus", 
"PHOSPHATE-BINDING PROTEIN (PBP) COMPLEXED WITH PHOSPHATE"), 
    chainId = c("A", "A", "A", "A", "A", "A"), ligandId = c("BMA,CL,FUC,MAN,NAG,NDG", 
    "CL,NAG,SO4", "MET", "MSE,PBE,ZN", "GLY,MET", "PO4"), ligandName = c("BETA-D-MANNOSE,CHLORIDE ION,ALPHA-L-FUCOSE,ALPHA-D-MANNOSE,N-ACETYL-D-GLUCOSAMINE,2-(ACETYLAMINO)-2-DEOXY-A-D-GLUCOPYRANOSE", 
    "CHLORIDE ION,N-ACETYL-D-GLUCOSAMINE,SULFATE ION", "METHIONINE", 
    "SELENOMETHIONINE,1,1-DIMETHYL-PROLINIUM,ZINC ION", "GLYCINE,METHIONINE", 
    "PHOSPHATE ION")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

I'd like to split the values of ligandId and ligandName in different rows. I mean, just 1 ligandId and ligandName per row.

I've tried using separate_rows but it didn't handle well with my two columns.

df %>% separate_rows(ligandId, ligandName, sep = ",")

But I'm getting this error:

> df %>% separate_rows(ligandId, ligandName, sep = ",")
Error: All nested columns must have the same number of elements.
Call `rlang::last_error()` to see a backtrace
> rlang::last_error()
<error>
message: All nested columns must have the same number of elements.
class:   `rlang_error`
backtrace:
  1. tidyr::separate_rows(., ligandId, ligandName, sep = ",")
 10. tidyr:::unnest.data.frame(data, !!!syms(vars), .drop = FALSE)
 12. tidyr::separate_rows(., ligandId, ligandName, sep = ",")
Call `rlang::last_trace()` to see the full backtrace

Also, I tried this: Split comma-separated strings in a column into separate rows but was unsuccessful.

In the end I'd like to have something like this:

1JDN   A   BMA   BETA-D-MANNOSE
1JDN   A   CL    CHLORIDE ION
1JDN   A   FUC   ALPHA-L-FUCOSE
1JDN   A   MAN   ALPHA-D-MANNOSE
1JDN   A   NAG   N-ACETYL-D-GLUCOSAMINE
1JDN   A   NDG   2-(ACETYLAMINO)-2-DEOXY-A-D-GLUCOPYRANOSE
...

Solution

  • We can use separate_rows

    library(tidyverse)
    df1 %>% 
        separate_rows(ligandId, sep=",")
    

    As the number of words for each row of 'ligandId', 'ligandName' are not the same, one option is to gather into 'long' format, then do the separate_rows on the 'val' column, and finally spread it back to 'wide'

    df1 %>% 
      gather(key, val, ligandId, ligandName) %>%
      separate_rows(val, sep=",") %>% 
      group_by(structureId, key) %>% 
      mutate(rn = row_number()) %>% 
      spread(key, val) %>%
      select(-rn)
    # A tibble: 17 x 5
    # Groups:   structureId [6]
    #   structureId structureTitle                                        chainId ligandId ligandName                
    #   <chr>       <chr>                                                 <chr>   <chr>    <chr>                     
    # 1 1DP4        DIMERIZED HORMONE BINDING DOMAIN OF THE ATRIAL NATRI… A       CL       CHLORIDE ION              
    # 2 1DP4        DIMERIZED HORMONE BINDING DOMAIN OF THE ATRIAL NATRI… A       NAG      N-ACETYL-D-GLUCOSAMINE    
    # 3 1DP4        DIMERIZED HORMONE BINDING DOMAIN OF THE ATRIAL NATRI… A       SO4      SULFATE ION               
    # 4 1IXH        PHOSPHATE-BINDING PROTEIN (PBP) COMPLEXED WITH PHOSP… A       PO4      PHOSPHATE ION             
    # 5 1JDN        Crystal Structure of Hormone Receptor                 A       BMA      BETA-D-MANNOSE            
    # 6 1JDN        Crystal Structure of Hormone Receptor                 A       CL       CHLORIDE ION              
    # 7 1JDN        Crystal Structure of Hormone Receptor                 A       FUC      ALPHA-L-FUCOSE            
    # 8 1JDN        Crystal Structure of Hormone Receptor                 A       MAN      ALPHA-D-MANNOSE           
    # 9 1JDN        Crystal Structure of Hormone Receptor                 A       NAG      N-ACETYL-D-GLUCOSAMINE    
    #10 1JDN        Crystal Structure of Hormone Receptor                 A       NDG      2-(ACETYLAMINO)-2-DEOXY-A…
    #11 1P99        1.7A crystal structure of protein PG110 from Staphyl… A       GLY      GLYCINE                   
    #12 1P99        1.7A crystal structure of protein PG110 from Staphyl… A       MET      METHIONINE                
    #13 1SW1        Crystal structure of ProX from Archeoglobus fulgidus… A       MSE      SELENOMETHIONINE          
    #14 1SW1        Crystal structure of ProX from Archeoglobus fulgidus… A       PBE      1                         
    #15 1SW1        Crystal structure of ProX from Archeoglobus fulgidus… A       ZN       1-DIMETHYL-PROLINIUM      
    #16 1SW1        Crystal structure of ProX from Archeoglobus fulgidus… A       <NA>     ZINC ION                  
    #17 1XS5        The Crystal Structure of Lipoprotein Tp32 from Trepo… A       MET      METHIONINE            
    

    For multiple columns with difference in the number of words, use cSplit

    library(splitstackshape)
    na.omit(cSplit(df1, c("ligandId", "ligandName"), sep=",", "long"))