Search code examples
excelfull-text-searchpowerquerytext-extraction

Extracting data with an associated tag/unit


I have been attempting to separate out key data hidden within sentences of text e.g:

enter image description here

I have made some progress with the following code however it pulls undesired values too:

let
    Source = Excel.CurrentWorkbook(){[Name="Table3"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Input", type text}, {"Desired OutPut", type any}, {"Bonus", type text}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each if Text.Contains([Input], "mmHg") then Text.Remove([Input],Text.ToList(Text.Remove([Input],{"0".."9","-", " ", "."}))) else null),
    #"Trimmed Text" = Table.TransformColumns(#"Added Custom",{{"Custom", Text.Trim, type text}})
in
    #"Trimmed Text"

enter image description here

As you can see other numerical data is being pulled.

I think however following these rules is perhaps the wrong way to go about this and wonder If it's possible to use mmHg as a Tag to extract 'nearby` data. Ideally the value or range will be touching "mmHg" however there are instances where this isnt the case hence this idea of nearby logic. I apprecaite I could remove all data except numbers and mmgH however I think this idea of tagging if possible will be very useful going forward. In my mind im thinking like: if Text contains mmHg then search for {0..9,"-"} within X charecters (say 10 to the left). Is this possible?

As sort of extra I will attempt to extract the Eye that this pressure is found in. Here I wish to use some soft of logic with a sort of first come first serve basis. I think this it an okay assumption that the first pressure will relate to the first mentioned eye per sentence. I am unsure how to do this in M code. This may however warrant a seperate question.


Solution

  • I think you can utilize regular expressions here:


    Step 1):

    Add a custom function to the group of your table:

    enter image description here

    In this case I called it 'fnRegexExtr' (much like a previous question you asked). The source function I used came from here and is a regex-replace function.

    (x,y,z)=>
    let 
       Source = Web.Page(
                         "<script>var x="&"'"&x&"'"&";var z="&"'"&z&
                         "'"&";var y=new RegExp('"&y&"','g');
                         var b=x.replace(y,z);document.write(b);</script>")
                         [Data]{0}[Children]{0}[Children]{1}[Text]{0}
    in 
       Source
    

    Step 2):

    On the 'Add Column' tab, invoke this custom function. Use the following parameters:


    Step 3):

    We can add another column using the same function with different parameters:

    Please note the trailing spaces. Using spaces inbetween capture group 1 makes that PQ will auto-trim the result.


    Step 4):

    Under tab 'Transform' I simply replaced errors with 'null' values.


    Step 5):

    Edited the M-code to replace spaces inbetween values with comma-space delimiters.


    Result:

    enter image description here


    M-Code:

    let
        Source = Excel.CurrentWorkbook(){[Name="Tabel1_2"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"Input", type text}}),
        #"Invoked Custom Function" = Table.AddColumn(#"Changed Type", "mmHg", each Text.Replace(fnRegexExtr([Input], "(\\d+(?:-\\d+)?)\\D*mmHg|.", "$1 ")," ",", ")),
        #"Invoked Custom Function1" = Table.AddColumn(#"Invoked Custom Function", "Side", each Text.Replace(fnRegexExtr([Input], "\\b(right|left)\\s*eye\\b|.", "$1 ")," ",", ")),
        #"Replaced Errors" = Table.ReplaceErrorValues(#"Invoked Custom Function1", {{"mmHg", null}, {"Side", null}})
    in
        #"Replaced Errors"