Search code examples
rdataframeunique

Removing quasi-duplicates from an R Dataframe


I have a Dataframe of two columns. Column one is an identification number and column 2 is a compound. The compounds in column 2 however, are often repetative (different forms of the same compound). I would like to remove every duplicate except the simples form of the compound.

This is the Dataframe:

>NISTSpecR

     NIST                                                     NAME
   366620                              Formic acid, TMS derivative
   366765 2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative
   342340                              Acetic acid, TMS derivative
   352374                           Propanoic acid, TMS derivative
   333858                             Butyric Acid, TMS derivative
   352377                           Pentanoic acid, TMS derivative
    24239                            Hexanoic acid, TMS derivative
   333733                           Heptanoic acid, TMS derivative
   352455                             Oxalic acid, 2TMS derivative
   414056                   Succinic acid, monoethyl ester-, (TMS)
   332809                              Adipic acid, TMS derivative
    30799                            Pimelic acid, 2TMS derivative
   292699                            Suberic acid, 2TMS derivative
   333874                             Citric acid, 4TMS derivative
   366657                             Citric acid, 3TMS derivative
   333513                         (-)-Epinephrine, 3TMS derivative
    16985                  Epinephrine, (.beta.)-, 3TMS derivative
    24795                    Norepinephrine, (R)-, 5TMS derivative
   332935                       DL-Norepinephrine, 4TMS derivative

And here is its structure:

> str(NISTSpecR)

'data.frame':   154 obs. of  3 variables:
 $ Spec: Factor w/ 239429 levels "1 0; 13 2; 14 27; 15 239; 16 3; 18 2; 26 3; 27 36; 28 32; 29 113; 30 9; 31 64; 32 9; 33 17; 34 17; 35 20; 36 1; 37 1; 41 8; 42 "| __truncated__,..: 23720 32791 3011 32175 12349 29069 193166 26108 28713 73845 ...
 $ NIST: chr  "366620" "366765" "342340" "352374" ...
 $ NAME: Factor w/ 239430 levels "-4'-Dimethylamino-2'-(trimethylsilyl)acetanilide",..: 157152 39442 108436 210392 133148 199151 169386 168243 195800 229235 ...

I would like the end Result to look something like this:

>NISTSpecR

     NIST                                                     NAME
   366620                              Formic acid, TMS derivative
   342340                              Acetic acid, TMS derivative
   352374                           Propanoic acid, TMS derivative
   333858                             Butyric Acid, TMS derivative
   352377                           Pentanoic acid, TMS derivative
    24239                            Hexanoic acid, TMS derivative
   333733                           Heptanoic acid, TMS derivative
   352455                             Oxalic acid, 2TMS derivative
   414056                   Succinic acid, monoethyl ester-, (TMS)
   332809                              Adipic acid, TMS derivative
    30799                            Pimelic acid, 2TMS derivative
   292699                            Suberic acid, 2TMS derivative
   366657                             Citric acid, 3TMS derivative
   333513                         (-)-Epinephrine, 3TMS derivative
    24795                    Norepinephrine, (R)-, 5TMS derivative

There is only one of each parent compound (ie. Formic Acid,...). AND it needs to be the simplest version (the one with the least characters).

> dput(as.character(NISTSpecR$NAME))

c("Formic acid, TMS derivative", "2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative", 
"Acetic acid, TMS derivative", "Propanoic acid, TMS derivative", 
"Butyric Acid, TMS derivative", "Pentanoic acid, TMS derivative", 
"Hexanoic acid, TMS derivative", "Heptanoic acid, TMS derivative", 
"Oxalic acid, 2TMS derivative", "Succinic acid, monoethyl ester-, (TMS)", 
"Adipic acid, TMS derivative", "Pimelic acid, 2TMS derivative", 
"Suberic acid, 2TMS derivative", "Citric acid, 4TMS derivative", 
"Citric acid, 3TMS derivative", "Citric acid 3TMS", "Citric acid, ethyl ester, tri-TMS", 
"Isocitric acid lactone, 2TMS derivative", "Glyoxylic acid, di-TMS", 
"Pyruvic acid, TMS derivative", "Malic acid, 2TMS derivative", 
"Malic acid 1-ethyl ester, 2TMS", "Malic acid, 4-ethyl ester, 2TMS", 
"Malic acid, 3TMS derivative", "4-Hydroxybutanoic acid, 2TMS derivative", 
"Prostaglandin A1, 2TMS derivative", "Prostaglandin A2, 2TMS derivative", 
"Prostaglandin E2, 3TMS", "D-Arabinose, 4TMS derivative", "D-Xylose, 4TMS derivative", 
"D-Lyxose, 4TMS derivative", "D-Ribose, 4TMS derivative", "D-Glucose, 5TMS derivative", 
"D-Galactose, 5TMS derivative", "D-Mannose, 5TMS derivative", 
"D-Allose, oxime (isomer 1), 6TMS derivative", "D-Allose, oxime (isomer 2), 6TMS derivative", 
"D-Altrose, 5TMS derivative", "Dihydroxyacetone, 2TMS derivative", 
"1,3-Dihydroxyacetone dimer, 4TMS derivative", "D-Fructose, 5TMS     derivative", 

"D-Psicose, 5TMS derivative", "Sedoheptulose, 6TMS derivative", "D-2-Deoxyribose, 3TMS derivative", "2-Deoxyribose, 3TMS derivative", "L-Fucose, 4TMS derivative", "L-Rhamnose, (R,R,S,S)-, 4TMS derivative", "L-Rhamnose, 4TMS derivative", "N-Acetyl-D-glucosamine, 4TMS derivative", "D-Gluconic acid, 6TMS derivative", "Glycerol monostearate, 2TMS derivative", "Glycerol 2-laurate, 2TMS derivative", "Glycerol, 3TMS derivative", "Xylitol, 5TMS derivative", "D-Sorbitol, 6TMS derivative", "D-Mannitol, 6TMS derivative", "Sucrose, 8TMS derivative", "D-Lactose, (isomer 1), 8TMS derivative", ".beta.-D-Lactose, (isomer 1), 8TMS derivative", "D-Lactose, (isomer 2), 8TMS derivative", ".beta.-D-Lactose, (isomer 2), 8TMS derivative", ".alpha.-D-Lactose, 8TMS derivative", ".alpha.-D-Lactose, 8TMS derivative", ".beta.-Lactose, 8TMS derivative", "Lactose, 8TMS derivative", "Maltose, 8TMS derivative , isomer 1", "Maltose, 8TMS derivative , isomer 2", "Maltose, 8TMS derivative", "D-Trehalose, 7TMS derivative", "Melibiose, 8TMS derivative", "L-Ornithine, 3TMS derivative", "DL-Ornithine, 3TMS derivative", "DL-Ornithine, 4TMS derivative", "L-Ornithine, 4TMS derivative", "L-Homoserine, 2TMS derivative", "L-Citrulline, 3TMS derivative", "3-Iodo-L-tyrosine, 3TMS derivative", "3-Aminoisobutyric acid, TMS derivative", "3-Aminoisobutyric acid, 3TMS derivative", "3-Aminoisobutyric acid, 2TMS derivative", "D-Isoleucine, N-acetyl-, TMS derivative", "L-Hydroxyproline, (E)-, 2TMS derivative", "L-Hydroxyproline, (E)-, 3TMS derivative", "Hydroxyproline, 3TMS derivative", "3-Hydroxyproline, 3TMS derivative", "L-Cystine, 4TMS derivative", "Ethanolamine, 3TMS derivative", "Ethanolamine, 2TMS derivative", "3-Aminopropanol, TMS derivative", "Putrescine, 4TMS derivative", "Histamine, 2TMS derivative", "Histamine, 3TMS derivative", "Dopamine, 4TMS derivative", "Dopamine, 3TMS derivative", "Serotonin, 4TMS derivative", "Tyramine, 3TMS derivative", "Tyramine, TMS derivative", "Tyramine, 2TMS derivative", "Phenethylamine, 2TMS derivative", "1-Phenethylamine, TMS derivative", "Phenethylamine, TMS derivative", "Biotin, 3TMS derivative", "16.beta.,17.alpha.-Estriol, 3TMS derivative", "Estriol, 3TMS derivative", "16.alpha.,17.alpha.-Estriol, 3TMS derivative", "16.beta.,17.beta.-Estriol, 3TMS derivative", "Estrone, TMS derivative", "16-Estrone, TMS derivative", "Estrone, O-methyloxime, TMS derivative", "Equilin, TMS derivative", "Equilenin, (14.beta.)-, TMS derivative", "Equilenin, TMS derivative", "2-Hydroxyestradiol, 3TMS derivative", "Androsterone, (E)-, TMS derivative", "Dehydroepiandrosterone, (E)-, TMS derivative", "5.beta.-Dihydrotestosterone, TMS derivative", "5.alpha.-Dihydrotestosterone, TMS derivative", "Testosterone O-methyloxime, TMS derivative", "Testosterone, TMS derivative", "Pregnenolone, TMS derivative", "Aldosterone, 2TMS derivative", "Aldosterone, N-methoxy-tri-TMS", "Corticosterone, bis(O-methyloxime)", "Deoxycholic Acid, 2TMS derivative", "Deoxycholic Acid, 3TMS derivative", "Lithocholic acid, 2TMS derivative", "Cholesterol, TMS derivative", "Desmosterol, TMS derivative", "Ergosterol, TMS derivative", "Campesterol, TMS derivative", "Fucosterol, TMS derivative", "Stigmastanol, TMS derivative", "Stigmasterol, TMS derivative", "11-Deoxycortisol, bis(O-methyloxime)", "Melatonin, 2TMS derivative", "Adrenaline, 4TMS derivative", "L-Adrenaline, 4TMS derivative", "Glycine, 3TMS derivative", "Glycine, TMS derivative", "Glycine, 2TMS derivative", "Aspartic acid, 3TMS derivative", "L-Aspartic acid, 3TMS derivative", "L-Aspartic acid, 2TMS derivative", "L-Glutamic acid, 3TMS derivative", "(-)-Epinephrine, 3TMS derivative", "Epinephrine, (.beta.)-, 3TMS derivative", "(-)-Epinephrine, 4TMS derivative", "Norepinephrine, (R)-, 5TMS derivative", "DL-Norepinephrine, 4TMS derivative", "Norepinephrine, (R)-, 4TMS derivative", "Cycloserine, 3TMS derivative", "Cycloheximide, 2TMS derivative", "Chloramphenicol, 2TMS derivative", "Chloramphenicol, 3TMS derivative" )

Thank You.


Solution

  • Following your edits I have done as follows: First, extract the wordings with matching suffixes

    parents <- extract_indices <- str_split(nist, ",") %>% 
      lapply(str_extract, "[A-z][a-z]+(ine|ol|in|ose|ic|one|ide)")
    

    Then, since some of those words had more than a single comma in them, extract the occurrence of non NA values to the list extract_indices, and save the index which this occurred in each list element to the vector indvec

    extract_indices <- parents %>% 
      lapply(function(x) which(!is.na(x)))
    indvec <- do.call("c",extract_indices)
    

    Then loop through the parents and for each list element, extract the vector which the parent compound occurred.

    answer <- sapply(seq_along(parents),
           function(i){
             parents[[i]][indvec][i]
           })
    
       answer
    
      [1] "Formic"                 "Acetic"                 "Acetic"                 "Propanoic"              "Butyric"               
      [6] "Pentanoic"              "Hexanoic"               "Heptanoic"              "Oxalic"                 "Succinic"              
     [11] "Adipic"                 "Pimelic"                "Suberic"                "Citric"                 "Citric"                
     [16] "Citric"                 "Citric"                 "Isocitric"              "Glyoxylic"              "Pyruvic"               
     [21] "Malic"                  "Malic"                  "Malic"                  "Malic"                  "Hydroxybutanoic"       
     [26] "Prostaglandin"          "Prostaglandin"          "Prostaglandin"          "Arabinose"              "Xylose"                
     [31] "Lyxose"                 "Ribose"                 "Glucose"                "Galactose"              "Mannose"               
     [36] "Allose"                 "Allose"                 "Altrose"                "Dihydroxyacetone"       "Dihydroxyacetone"      
     [41] "Fructose"               "Psicose"                "Sedoheptulose"          "Deoxyribose"            "Deoxyribose"           
     [46] "Fucose"                 "Rhamnose"               "Rhamnose"               "glucosamine"            "Gluconic"              
     [51] "Glycerol"               "Glycerol"               "Glycerol"               "Xylitol"                "Sorbitol"              
    

    It continues like this...

    Now, to consider that you only want the shortest of each one, as calculated by the least characters, first count the characters in the original dataset, then for each of the short answer has matches, select the one from the original data with the shortest character.

    nchar_parent <- nchar(nist)
    final <- vector(mode = "character", length(nist))
    for(i in seq_along(nist)){
      temp_matches <- which(match(answer,answer[i])==TRUE)
      shortest <- temp_matches[which.min(nchar_parent[temp_matches])]
      final[i] <- nist[shortest]
    }
    

    Your final answer looks like this

    [1] "Formic acid, TMS derivative"                  "Acetic acid, TMS derivative"                 
      [3] "Acetic acid, TMS derivative"                  "Propanoic acid, TMS derivative"              
      [5] "Butyric Acid, TMS derivative"                 "Pentanoic acid, TMS derivative"              
      [7] "Hexanoic acid, TMS derivative"                "Heptanoic acid, TMS derivative"              
      [9] "Oxalic acid, 2TMS derivative"                 "Succinic acid, monoethyl ester-, (TMS)"      
     [11] "Adipic acid, TMS derivative"                  "Pimelic acid, 2TMS derivative"               
     [13] "Suberic acid, 2TMS derivative"                "Citric acid 3TMS"                            
     [15] "Citric acid 3TMS"                             "Citric acid 3TMS"                            
     [17] "Citric acid 3TMS"                             "Isocitric acid lactone, 2TMS derivative"     
     [19] "Glyoxylic acid, di-TMS"                       "Pyruvic acid, TMS derivative"                
     [21] "Malic acid, 2TMS derivative"                  "Malic acid, 2TMS derivative"                 
     [23] "Malic acid, 2TMS derivative"                  "Malic acid, 2TMS derivative"