Search code examples
excelsplitextractpowerquerytext-extraction

Extract numbers from text by minimum length of number string using power query


Problem I've been tasked to tidy up some very messy data containing a mix of text and numbers and wish to use power query to separate codes from the data. Fortunately, the codes that need separating consist of only numerical values and appear to be 7 characters in length (let's say 6 or greater).

Below is an example of how I wish to separate the data:

enter image description here

So Far: So Far I have this code:

let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Input ", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type","_"," ",Replacer.ReplaceText,{"Input "}),
#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","v"," ",Replacer.ReplaceText,{"Input "}),
#"Added Custom" = Table.AddColumn(#"Replaced Value1", "TextSplit", each Text.Split([#"Input "], " ")),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "ListTransform", each List.Transform([TextSplit], each Text.Select(_,{"0".."9"}))),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "ListSelect", each List.Select([ListTransform], each Text.Length(_)>=5)),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "TextCombine", each Text.Combine([ListSelect], ", ")),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom3",{"TextSplit", "ListTransform", "ListSelect"})

in #"Removed Columns"

which does appear to solve the problem. However, in cases like 0102646v2.0 this is pulled through as 010264620. In order to get it to work, I have had to introduce steps to replace _ and "v" with " ". Is it not possible for Power Query to recognize that say 0102646v2.0 should be extracted as 0102646?

Data:

Input Values:
3159087 v1.0
3194070 v1.0
#8102368 V3.0 (Shine and ProtectR18)
#8102371 V4.0 (Lemon 12A Degreaser)
Marine (FF3080300 v1.0)
Green Apple (FF3080301 v1.0)
0102646v2.0 (Fresh Cotton)
TDS# 3129801 V1.0 GPA Code#3123402
FF3112964 0.1 FF3145524 0.1_3152912 0.1

Thank you!

Update: Pulling Version Number

enter image description here

Regex


Solution

  • Here's a PQ implementation of extracting patterns using Regex and returning them comma separated:

    Add this as a custom function. I named it fnRegexExtr

    //see http://www.thebiccountant.com/2018/04/25/regex-in-power-bi-and-power-query-in-excel-with-java-script/
    // and https://gist.github.com/Hugoberry/4948d96b45d6799c47b4b9fa1b08eadf
    
    let   fx=(text,regex)=>
        Web.Page(
            "<script>
                var x='"&text&"';
                var y=new RegExp('"&regex&"','g');
                var b=x.match(y);
                document.write(b);
            </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}
    
    in
    fx
    

    You can then use it in your code like this:

    let
        Source = Excel.CurrentWorkbook(){[Name="Table10"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"Input", type text}}),
        #"Added Custom" = Table.AddColumn(#"Changed Type", "Output", 
            each fnRegexExtr([Input], "[0-9]{6,}"))
    in
        #"Added Custom"
    

    which returns:

    enter image description here