Search code examples
excelsplitpowerquerytext-extraction

Extract CAS Number from Downloaded Data


I have downloaded a CSV file from Pubchem containing over 5000+ records. One of the columns contains a bunch of computed synonyms where CAS Number is the records I wish to extract. Unfortunately, the CAS number isn't necessarily in the same position in this list, making splitting by delimiter more difficult. Below is the source data example and the desired output I am trying to achieve.

enter image description here

An older answer to a post a while back used a Regex function to extract strings of Numbers with a given length.

fnRegexExtr

let   fx=(text,regex)=>
    Web.Page(
        "<script>
            var x='"&text&"';
            var y=new RegExp('"&regex&"','g');
            var b=x.match(y);
            document.write(b);
        </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}

in
fx

Unsure if this is possible here and unfamiliar with Regex but I'm wondering if it is possible to modify this function to extract CAS numbers. The difficulty is that CAS Numbers can be in various formats CAS Numbers are up to 10 digits long using the format xxxxxxx-yy-z.

If anyone has any alternative solutions to extracting CAS numbers with this somewhat complex data feel free to post.

Data:

cid and cmpdname can be anything.

1-Aminopropan-2-ol|1-AMINO-2-PROPANOL|78-96-6|Isopropanolamine|Monoisopropanolamine
1-chloro-2,4-dinitrobenzene|2,4-Dinitrochlorobenzene|97-00-7|Dinitrochlorobenzene|DNCB|Chlorodinitrobenzene|CDNB
1,2-dichloroethane|Ethylene dichloride|107-06-2|Ethylene chloride|Ethane, 1,2-dichloro-|Glycol dichloride|Dutch liquid|Dutch oil|Ethane dichloride|Aethylenchloride
1,2,4-trichlorobenzene|120-82-1|Benzene, 1,2,4-trichloro-|unsym-Trichlorobenzene|Hostetex L-pec|Trojchlorobenzene
CHLOROACETALDEHYDE|2-chloroacetaldehyde|107-20-0|Chloroethanal|2-Chloroethanal|Acetaldehyde, chloro-|Chloroaldehyde|Monochloroacetaldehyde|2-Chloro-1-ethanal

Solution

  • In PQ, this will pull out the contents of any item that does not contain a letter in cmpdsynonym, which I think is basically what you are looking for

       let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
       #"Added Custom" = Table.AddColumn(Source, "Custom.3", each List.RemoveNulls(List.Transform(Text.Split([cmpdsynonym],"|"), each if _ = Text.Remove (_,{"A".."Z","a".."z"}) then _ else null)){0})
       in #"Added Custom"
    

    enter image description here