Search code examples
excelsplitoffice365powerquerydelimiter

Conditional Split if Delimiter is missing


Background: The table below contains example data from relevant documents (the names of documents are MSDS, RMQ, COA, Technical Data sheet) for materials A,B,C etc. The information from these documents includes the date, Heavy Metal impurities and Residual Solvent impurities and their amount in ppm (parts per million).

Using power query I have sorted this data so that the 2 tables shown in the image below are produced.

enter image description here

These containing the highest amounts of heavy metals (blue) and residual solvents (yellow) found across the documentation as well as source of the document containing this value. To replicate the spreadsheet I have provided the (quite extensive) M code at the bottom. Very Briefly though for this problem; "Heavy Metals" and "Residual Solvents" are phrases used as delimiters to split the data accordingly.

Minor Problem: Although pleased with how the table functions, I didn't feel that the 'splitting of a split column' (see M code) is an entirely satisfactory solution to separate the data. Subsequently I've realised that If a cell were to accidently not include "Heavy Metals" as a delimiter the logic would cause the Residual Solvent data for this cell to be lost (as is the case for Cell 4E (Material C, technical data Sheet)).

I may just insist to those using this spreadsheet to ensure these phrases are always present however I wanted to ask here to see if anyone had any clever alternatives to the M Code provided so that although the Heavy Metals may be missing without the delimiter (or if spelled incorreclty), the Residual Solvents are still pulled through.

I appreciate that this is rather a tasking job for someone to look at, and fortunately it is a relatively minor issue so any advice would just be a bonus. I also just through it was quite quite interesting to show how power query can be used to split seemingly complex data within a cell. Also please note that the Data in the table is 'messy' to test if this causes any problems.

M Code: This is the Code for just the Residual Solvents Table. Power query splits the data into heavy Metals and Residual Solvents and then depending on the table removes the appropriate columns.

    let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Material", type text}, {"MSDS", type text}, {"RMQ", type text}, {"COA", type text}, {"Technical Data Sheet", type text}}),
    
    //Clean Up user Input (removed additional Spaces)
    #"Trimmed Text" = Table.TransformColumns(#"Changed Type",{{"Material", Text.Trim, type text}, {"MSDS", Text.Trim, type text}, {"RMQ", Text.Trim, type text}, {"COA", Text.Trim, type text}, {"Technical Data Sheet", Text.Trim, type text}}),
   
   //Split Data into Date, Heavy Metals and Residual Solvents
   //MSDS 
    #"Split MSDS by Heavy Metals" = Table.SplitColumn(#"Trimmed Text", "MSDS", Splitter.SplitTextByDelimiter("Heavy Metals", QuoteStyle.Csv), {"MSDS Date", "MSDS Heavy Metals"}),
    #"Split MSDS by Residual Solvents" = Table.SplitColumn(#"Split MSDS by Heavy Metals", "MSDS Heavy Metals", Splitter.SplitTextByDelimiter("Residual Solvents", QuoteStyle.Csv), {"MSDS Heavy Metals", "MSDS Residual Solvents"}),
    
    // RMQ
    #"Split RMQ by Heavy Metals" = Table.SplitColumn(#"Split MSDS by Residual Solvents", "RMQ", Splitter.SplitTextByDelimiter("Heavy Metals", QuoteStyle.Csv), {"RMQ Date", "RMQ Heavy Metals"}),
    #"Split Column by Residual Solvents" = Table.SplitColumn(#"Split RMQ by Heavy Metals", "RMQ Heavy Metals", Splitter.SplitTextByDelimiter("Residual Solvents", QuoteStyle.Csv), {"RMQ Heavy Metals", "RMQ Residual Solvents"}),
    
    //COA
    #"Split COA by Heavy Metals" = Table.SplitColumn(#"Split Column by Residual Solvents", "COA", Splitter.SplitTextByDelimiter("Heavy Metals", QuoteStyle.Csv), {"COA Date", "COA Heavy Metals"}),
    #"Split COA by Residual Solvents" = Table.SplitColumn(#"Split COA by Heavy Metals", "COA Heavy Metals", Splitter.SplitTextByDelimiter("Residual Solvents", QuoteStyle.Csv), {"COA Heavy Metals", "COA Residual Solvents"}),
    
    //Technical Data Sheet
    #"Split Technical Data Sheet by Heavy Metals" = Table.SplitColumn(#"Split COA by Residual Solvents", "Technical Data Sheet", Splitter.SplitTextByDelimiter("Heavy Metals", QuoteStyle.Csv), {"Technical Data Sheet Date", "Technical Data Sheet Heavy Metals"}),
    #"Split Technical Data Sheet by Residual Solvents" = Table.SplitColumn(#"Split Technical Data Sheet by Heavy Metals", "Technical Data Sheet Heavy Metals", Splitter.SplitTextByDelimiter("Residual Solvents", QuoteStyle.Csv), {"Technical Data Sheet Heavy Metals", "Technical Data Sheet Residual Solvents"}),
    
    //Changes Data to date type
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Technical Data Sheet by Residual Solvents",{{"MSDS Date", type date}, {"RMQ Date", type date}, {"COA Date", type date}, {"Technical Data Sheet Date", type date}}),
   
   //Remove Date (add // to add date back into data)
    #"Removed Date" = Table.RemoveColumns(#"Changed Type1",{"MSDS Date", "RMQ Date", "COA Date", "Technical Data Sheet Date"}),
    
    //Clean up unnecessary line breaks
    #"Trimmed Text1" = Table.TransformColumns(#"Removed Date",{{"Material", Text.Trim, type text}, {"MSDS Heavy Metals", Text.Trim, type text}, {"MSDS Residual Solvents", Text.Trim, type text}, {"RMQ Heavy Metals", Text.Trim, type text}, {"RMQ Residual Solvents", Text.Trim, type text}, {"COA Heavy Metals", Text.Trim, type text}, {"COA Residual Solvents", Text.Trim, type text}, {"Technical Data Sheet Heavy Metals", Text.Trim, type text}, {"Technical Data Sheet Residual Solvents", Text.Trim, type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Trimmed Text1",{"MSDS Heavy Metals", "RMQ Heavy Metals", "COA Heavy Metals", "Technical Data Sheet Heavy Metals"}),
    
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"MSDS Residual Solvents", "MSDS"}, {"RMQ Residual Solvents", "RMQ"}, {"COA Residual Solvents", "COA"}, {"Technical Data Sheet Residual Solvents", "Technical Data Sheet"}}),
    
    //Unpivot data into columns, split and clean up as necessary necessary
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Renamed Columns", {"Material"}, "Source", "Amount"),
    
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Unpivoted Other Columns", {{"Amount", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Amount"),
    #"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Amount", type text}}),
    #"Trimmed Text2" = Table.TransformColumns(#"Changed Type2",{{"Amount", Text.Trim, type text}}),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Trimmed Text2", "Amount", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, true), {"Heavy Metal", "Amount"}),
    #"Changed Type3" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Heavy Metal", type text}, {"Amount", Int64.Type}}),

 // if date is to be included add this line 
        //{"Date", (t) => t[Date]{List.PositionOf(t[Amount],List.Max(t[Amount]))}, type date}


    //Group rows by Material and Metal
//Extract the highest amount and corresponding source
    #"Grouped Rows" = Table.Group(#"Changed Type3", {"Material", "Heavy Metal"}, {
        {"Amount", each List.Max([Amount]),type number},
        {"Source", (t) => t[Source]{List.PositionOf(t[Amount],List.Max(t[Amount]))}, type text}
       
        })
in
    #"Grouped Rows"

Link to file:

https://1drv.ms/u/s!AsrLaUgt0KCLhXtP-jYDd4Z0ujKQ?e=Ba8Htx


Solution

  • I would do the splitting between solvent and metals differently, so that it doesn't matter if one category, or the other, is missing.

    If there might be misspellings of Residual Solvents or Heavy Metals, you could even do some fuzzy matching instead of equality as I have in the code.

    • Unpivot other than the Material column to create three columns
    • Split the Value column by line feed into rows
    • Trim the Value column, then filter out the blanks
    • Add a custom column based on the Value column, copying over only anything that is a date, or the string Heavy Metals or Residual Solvents
    • Fill down so every row has an entry
    • Filter out the dates (by selecting just the Metals and Solvents entries).
    • Filter the Value and Custom columns (see notes in the code)
    • Split the Value column between the substance and the amount
    • This will leave you with a table of five columns
      • You can filter the fifth column for either Metals or Solvents
    • Then group by Material and extract what you want

    enter image description here

    M Code (for the solvents table)

    let
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"Material", type text}, {"MSDS", type text}, {"RMQ", type text}, {"COA", type text}, {"Technical Data Sheet", type text}}),
        
        //Unpivot to develop a single column of solvent/metals/date data
        #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"Material"}, "Attribute", "Value"),
    
        //split into rows by line feed
        #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Unpivoted Other Columns", 
            {{"Value", Splitter.SplitTextByDelimiter("#(lf)", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Value"),
        #"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Value", Text.Trim, type text}}),
    
        //filter out the blank rows
        #"Filtered Rows" = Table.SelectRows(#"Trimmed Text", each ([Value] <> "")),
    
        //Add custom column for separating the tables
        #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Custom", each try Date.FromText([Value]) otherwise 
            if [Value] = "Heavy Metals" or [Value] = "Residual Solvents" then [Value] else null),
        #"Changed Type1" = Table.TransformColumnTypes(#"Added Custom",{{"Custom", type text}}),
        #"Filled Down" = Table.FillDown(#"Changed Type1",{"Custom"}),
    
        //Filter the value and custom columns to remove contaminant type from Value column and remove dates from Custom column
        #"Filtered Rows1" = Table.SelectRows(#"Filled Down", 
            each ([Custom] = "Heavy Metals" or [Custom] = "Residual Solvents") and ([Value] <> "Heavy Metals" and [Value] <> "Residual Solvents")),
    
        //split substance from amount
        #"Split Column by Delimiter1" = Table.SplitColumn(#"Filtered Rows1", "Value", 
            Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, true), {"Substance", "Amount"}),
        #"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Substance", type text}, {"Amount", Int64.Type}}),
    
        //Filter for Solvents Table
        #"Filtered Rows2" = Table.SelectRows(#"Changed Type2", each ([Custom] = "Residual Solvents")),
    
        //Groub by Material and Substance, then extract the Max contaminant and Source
        #"Grouped Rows" = Table.Group(#"Filtered Rows2", {"Material", "Substance"}, {
            {"Amount", each List.Max([Amount]), type number},
            {"Source", (t) => t[Attribute]{List.PositionOf(t[Amount],List.Max(t[Amount]))}, type text}
            })
    in
        #"Grouped Rows"
    

    enter image description here

    Note

    For a learning experience, I decided to tackle the problem of dealing with the same Max amount being cited from two or more sources.

    I changed the Source extract line of the Table.Group function to return these as a semi-colon separated string:

        //Group by Material and Substance
        //Extract Max Amount and Source (or multiple Sources if max amount identical)
        #"Grouped Rows" = Table.Group(#"Filtered Rows2", {"Material", "Substance"}, {
            //{"All", each _, type table [Material=nullable text, Attribute=text, Substance=nullable text, Amount=nullable number, Custom=nullable text]},        
            {"Amount", each List.Max([Amount]), type number},
            //{"Source", (t) => t[Attribute]{List.PositionOf(t[Amount],List.Max(t[Amount]))}, type text},
            {"Source", (t)=> Text.Combine(
                List.Generate(() => [Counter=1, IDX = List.PositionOf(t[Amount],List.Max(t[Amount]))],
                     each [Counter] <= List.Count(List.PositionOf(t[Amount],List.Max(t[Amount]),4)),
                     each [Counter = [Counter]+1, IDX = List.PositionOf(t[Amount],List.Max(t[Amount]),4){[Counter]}],
                     each t[Attribute]{[IDX]}),"; ")}
            })
    

    Using List.Generate in this way was guided adapted from somewhat different problem addressed in Chris Webb's BI Blog