Search code examples
rregexdataframedata-extractiongrepl

What is the correct way to extract multiple required patterns from a column in the data using regular expression (grepl function) in R


I have a dataframe (can also use the CSV file which I generated) that has 7 columns and close to a million rows.

I am specifically interested in fetching 5 different "ID's" from this entire string of data that is present in a single column and want to create new columns with the specific "ID" as column header name.The data looks something like this -

 UNIQUE-NAME    TYPES        COMMON-NAME                FORMULA                                  SOURCE                                                                                                                                                                  WEIGHT     SECONDARY-WEIGHT

 CPD-12676      Compounds    glycerophosphoglycerol     (C 19);(H 29);(N 5);(O 10);(S 2)        (SPIDER "3315400" NIL | devarsh | 3468 NIL);(PUBCHEM "54758713" NIL | ong);(SMALL "35022" | NIL | Tom);(CHEBI "28643" | NIL);(BAS "55672-34-1" | Harry | NILL)           29.83      27.65
 CPD-380        Enamines     UDP                         NA                                      NA                                                                                                                                                                      294.37     287.33
 NAD            Steroids     prephenate                 (O 4);(S 1)                             (PUBCHEM "87361293" NIL | ong);(CHEBI "87100" | NIL); (BAS "91234-28-2" | Harry | NILL)                                                                                  373.47     381.24
 ADP            Rings        pyridine                   (C 5);(H 5);(N 1)                       (SPIDER "2311345" NIL | devarsh | 6789 NIL);(SMALL "98234" | NIL | Tom)                                                                                                  39.33      40.65
 CAD-392        Molecules    pyrine                     (C 10);(H 9)                            (SPIDER "5454678" NIL | devarsh | 7688 NIL)                                                                                                                              392.543    400.656
 CAD-355        Groups       solution                   (C 12);(H 12)                            NA                                                                                                                                                                      32.53      40.66
 CPD-39234      Compounds    glycerophosphoglycerol     (C 19);(H 29);(N 5);(O 10);(S 2)        (SMALL "45465" | NIL | Harry);(CHEBI "33490" | NIL)                                                                                                                      92.43      100.65

Now I am interested in the "SOURCE" column and want to fetch

SPIDER ID, PUBCHEM ID, SMALL ID, CHEBI ID AND BAS ID

and create a new column in my csv file with each of these ID's as column headers and store the ID numbers as rows.

So, I want my final output table to be like this -

 UNIQUE-NAME    TYPES        COMMON-NAME                FORMULA                                  SMALL ID     PUBCHEM ID      SPIDER ID         CHEBI ID      BAS ID       WEIGHT     SECONDARY-WEIGHT

 CPD-12676      Compounds    glycerophosphoglycerol     (C 19);(H 29);(N 5);(O 10);(S 2)         35022        54758713        3315400           28643         55672-34-1   29.83      27.65
 CPD-380        Enamines     UDP                         NA                                      NA           NA              NA                NA            NA           294.37     287.33
 NAD            Steroids     prephenate                 (O 4);(S                                 NA           87361293        NA                87100         91234-28-2   373.47     381.24
 ADP            Rings        pyridine                   (C 5);(H 5);(N 1)                        98234        NA              2311345           NA            NA           39.33      40.65
 CAD-392        Molecules    pyrine                     (C 10);(H 9)                             NA           NA              5454678           NA            NA           392.543    400.656
 CAD-355        Groups       solution                   (C 12);(H 12)                            NA           NA              NA                NA            NA           32.53      40.66
 CPD-39234      Compounds    glycerophosphoglycerol     (C 19);(H 29);(N 5);(O 10);(S 2)         45465        NA              NA                33490         NA           92.43      100.65 

I was using grepl function for achieving this result -

dt3 <- grepl("^[:punct:]SMALL\\s[:punct:][0-9][:punct:]|^[:punct:]SPIDER\\s[:punct:][0-9][:punct:]|[:punct:]BAS\\s[:punct:][0-9]-+[:punct:]|[:punct:]CHEBI\\s[:punct:][0-9][:punct:]|[:punct:]PUBCHEM\\s[:punct:][0-9][:punct:]", dt2$SOURCE)

where dt2 is the dataframe having the data (displayed in the first table above). I am only taking the SOURCE column and trying to match the pattern for the columns that I want to fetch. grepl is not throwing any error but giving all FALSE values which clearly means I am doing something wrong with the regex syntax here for pattern matching and R is not understanding what I want to fetch.

I would really appreciate your help on this. Thanks!

Sample source data - Sample data


Solution

  • Here is a very crude method using dplyr and stringr.

    library(dplyr)
    library(stringr)
    
    df %>% # this assumes your data is in a data frame called 'df'
      mutate(small_id = str_extract(SOURCE, 'SMALL "[0-9]+"')) %>% 
      mutate(pubchem_id = str_extract(SOURCE, 'PUBCHEM "[0-9]+"')) %>% 
      mutate(spider_id = str_extract(SOURCE, 'SPIDER "[0-9]+"')) %>% 
      mutate(chebi_id = str_extract(SOURCE, 'CHEBI "[0-9]+"')) %>% 
      mutate(bas_id = str_extract(SOURCE, 'BAS "[0-9]+-[0-9]+-[0-9]+"')) %>% 
      mutate_at(vars(ends_with('_id')), ~gsub('[A-Z]|"|\\s+', '', .)) %>% 
      select(-SOURCE)
    
      `UNIQUE-NAME` TYPES     `COMMON-NAME`          FORMULA                          WEIGHT `SECONDARY-WEIGHT` small_id pubchem_id spider_id chebi_id bas_id    
      <chr>         <chr>     <chr>                  <chr>                             <dbl>              <dbl> <chr>    <chr>      <chr>     <chr>    <chr>     
    1 CPD-12676     Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2)   29.8               27.6 35022    54758713   3315400   28643    55672-34-1
    2 CPD-380       Enamines  UDP                    NA                                294.               287.  NA       NA         NA        NA       NA        
    3 NAD           Steroids  prephenate             (O 4);(S 1)                       373.               381.  NA       87361293   NA        87100    91234-28-2
    4 ADP           Rings     pyridine               (C 5);(H 5);(N 1)                  39.3               40.6 98234    NA         2311345   NA       NA        
    5 CAD-392       Molecules pyrine                 (C 10);(H 9)                      393.               401.  NA       NA         5454678   NA       NA        
    6 CAD-355       Groups    solution               (C 12);(H 12)                      32.5               40.7 NA       NA         NA        NA       NA        
    7 CPD-39234     Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2)   92.4              101.  45465    NA         NA        33490    NA