I have a dataframe (can also use the CSV file which I generated) that has 7 columns and close to a million rows.
I am specifically interested in fetching 5 different "ID's" from this entire string of data that is present in a single column and want to create new columns with the specific "ID" as column header name.The data looks something like this -
UNIQUE-NAME TYPES COMMON-NAME FORMULA SOURCE WEIGHT SECONDARY-WEIGHT
CPD-12676 Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2) (SPIDER "3315400" NIL | devarsh | 3468 NIL);(PUBCHEM "54758713" NIL | ong);(SMALL "35022" | NIL | Tom);(CHEBI "28643" | NIL);(BAS "55672-34-1" | Harry | NILL) 29.83 27.65
CPD-380 Enamines UDP NA NA 294.37 287.33
NAD Steroids prephenate (O 4);(S 1) (PUBCHEM "87361293" NIL | ong);(CHEBI "87100" | NIL); (BAS "91234-28-2" | Harry | NILL) 373.47 381.24
ADP Rings pyridine (C 5);(H 5);(N 1) (SPIDER "2311345" NIL | devarsh | 6789 NIL);(SMALL "98234" | NIL | Tom) 39.33 40.65
CAD-392 Molecules pyrine (C 10);(H 9) (SPIDER "5454678" NIL | devarsh | 7688 NIL) 392.543 400.656
CAD-355 Groups solution (C 12);(H 12) NA 32.53 40.66
CPD-39234 Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2) (SMALL "45465" | NIL | Harry);(CHEBI "33490" | NIL) 92.43 100.65
Now I am interested in the "SOURCE" column and want to fetch
SPIDER ID, PUBCHEM ID, SMALL ID, CHEBI ID AND BAS ID
and create a new column in my csv file with each of these ID's as column headers and store the ID numbers as rows.
So, I want my final output table to be like this -
UNIQUE-NAME TYPES COMMON-NAME FORMULA SMALL ID PUBCHEM ID SPIDER ID CHEBI ID BAS ID WEIGHT SECONDARY-WEIGHT
CPD-12676 Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2) 35022 54758713 3315400 28643 55672-34-1 29.83 27.65
CPD-380 Enamines UDP NA NA NA NA NA NA 294.37 287.33
NAD Steroids prephenate (O 4);(S NA 87361293 NA 87100 91234-28-2 373.47 381.24
ADP Rings pyridine (C 5);(H 5);(N 1) 98234 NA 2311345 NA NA 39.33 40.65
CAD-392 Molecules pyrine (C 10);(H 9) NA NA 5454678 NA NA 392.543 400.656
CAD-355 Groups solution (C 12);(H 12) NA NA NA NA NA 32.53 40.66
CPD-39234 Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2) 45465 NA NA 33490 NA 92.43 100.65
I was using grepl
function for achieving this result -
dt3 <- grepl("^[:punct:]SMALL\\s[:punct:][0-9][:punct:]|^[:punct:]SPIDER\\s[:punct:][0-9][:punct:]|[:punct:]BAS\\s[:punct:][0-9]-+[:punct:]|[:punct:]CHEBI\\s[:punct:][0-9][:punct:]|[:punct:]PUBCHEM\\s[:punct:][0-9][:punct:]", dt2$SOURCE)
where dt2
is the dataframe having the data (displayed in the first table above). I am only taking the SOURCE
column and trying to match the pattern for the columns that I want to fetch. grepl
is not throwing any error but giving all FALSE
values which clearly means I am doing something wrong with the regex syntax here for pattern matching and R
is not understanding what I want to fetch.
I would really appreciate your help on this. Thanks!
Sample source data - Sample data
Here is a very crude method using dplyr
and stringr
.
library(dplyr)
library(stringr)
df %>% # this assumes your data is in a data frame called 'df'
mutate(small_id = str_extract(SOURCE, 'SMALL "[0-9]+"')) %>%
mutate(pubchem_id = str_extract(SOURCE, 'PUBCHEM "[0-9]+"')) %>%
mutate(spider_id = str_extract(SOURCE, 'SPIDER "[0-9]+"')) %>%
mutate(chebi_id = str_extract(SOURCE, 'CHEBI "[0-9]+"')) %>%
mutate(bas_id = str_extract(SOURCE, 'BAS "[0-9]+-[0-9]+-[0-9]+"')) %>%
mutate_at(vars(ends_with('_id')), ~gsub('[A-Z]|"|\\s+', '', .)) %>%
select(-SOURCE)
`UNIQUE-NAME` TYPES `COMMON-NAME` FORMULA WEIGHT `SECONDARY-WEIGHT` small_id pubchem_id spider_id chebi_id bas_id
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 CPD-12676 Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2) 29.8 27.6 35022 54758713 3315400 28643 55672-34-1
2 CPD-380 Enamines UDP NA 294. 287. NA NA NA NA NA
3 NAD Steroids prephenate (O 4);(S 1) 373. 381. NA 87361293 NA 87100 91234-28-2
4 ADP Rings pyridine (C 5);(H 5);(N 1) 39.3 40.6 98234 NA 2311345 NA NA
5 CAD-392 Molecules pyrine (C 10);(H 9) 393. 401. NA NA 5454678 NA NA
6 CAD-355 Groups solution (C 12);(H 12) 32.5 40.7 NA NA NA NA NA
7 CPD-39234 Compounds glycerophosphoglycerol (C 19);(H 29);(N 5);(O 10);(S 2) 92.4 101. 45465 NA NA 33490 NA