Search code examples
excelpowerbipowerquerym

Very tricky Power Query text extraction


I have two columns below where there are multiple “tenants” in one column and multiple “areas” in another. I want to extract all the “SA Majeste” and “PWGSC” entities and their corresponding areas such that my desired output is:

Tenant Areas
SA MAJESTE LA REINE (81001068) PWGSC 6,776
SA MAJESTE LA REINE (81001068) PWGSC 6,174
PWGSC (81000921) 16,964
PWGSC (81000732) 9,438
SA MAJESTE LA REINE ( 81001133) 6,075

Based on this source table

Tenant Areas
THE CHEMICAL INSTITUTE OF CANADA - [10] RESIDENT DOCTORS OF Vacant SA MAJESTE LA REINE (81001068) PWGSC ^ 0400D(3,643.00)-08/31/2025 0401B(7,453.00)-11/30/2023 0402(2,178.00)- CANADA - [11] 11/30/2026 0450(6,776.00) ^
CRESTVIEW STRATEGY INC. SA MAJESTE LA REINE (81001068) PWGSC Vacant - [12] [13] ^ 1100(4,748.00)-06/30/2030 1101(6,174.00)-11/30/2023 1105(1,715.00) ^
PWGSC (81000921) KELLY SANTINI MANAGEMENT CORP. KELLY SANTINI [14] ^ 2401(16,964.00)-09/30/2026 2402(14,535.00)-04/30/2026 2403(4,482.00)- ^
PWGSC (81000732) ^ BELL CANADA (Office) 1401(9,438.00)-02/29/2032 1402(29,711.00)-06/30/2030 ^
Vacant SA MAJESTE LA REINE ( 81001133) ^ 3000(11,925.00) 3001(6,075.00)-10/31/2024 ^

As can be seen, the SA Majeste or PWGSC entities are not always in second place. They could be in third, fourth or nth place. Thus, there is a pattern for extraction, but I had no luck figuring out the M formula for this extraction.

Unfortunately, all the data comes from a PDF online, so I have no control over how the data are structured…I’d have normalized the data to make the table analysis-friendly otherwise. Please help?


Solution

  • The following produces something "close" to what you show.

    • Split the Tenant column based on either the - or the word Vacant

    • Collect the Areas by looking for a sequence of digits enclosed in parentheses, that may or may not start with space.

    • Correlate the SA MAJ.. string or the PWGSC string location with the same area location and return that area.

    • Problem is that does not work as there does not seem to be a definitive, inclusive rule to also split CRESTVIEW STRATEGY INC. SA MAJESTE LA REINE into the two parts. I don't know if the string INC. would work, but that could be added if it does.

    Custom Function
    Add below function as Custom Query (from blank) and rename it fnRegexExtr

    //see http://www.thebiccountant.com/2018/04/25/regex-in-power-bi-and-power-query-in-excel-with-java-script/
    // and https://gist.github.com/Hugoberry/4948d96b45d6799c47b4b9fa1b08eadf
    
    let   
        fx=(text,regex)=>
        Web.Page(
            "<script>
                var x='"&text&"';
                var y=new RegExp('"&regex&"','g');
                var b=x.match(y);
                document.write(b);
            </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}
    
    in
        fx
    

    Main Code

    let
    
    //Change next line to reflect actual data source
        Source = Excel.CurrentWorkbook(){[Name="Table23"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"Tenant", type text}, {"Areas", type text}}),
    
    //Extract relevant Tenant
        #"Added Custom" = Table.AddColumn(#"Changed Type", "Tenants", each Text.Replace(
                fnRegexExtr([Tenant], "(SA MAJESTE.*?(?=\\s*\\())|(\\(\\s*\\d+\\))|(PWGSC)"),","," "), type text),
    
    //Extract relevant area
        #"Added Custom1" = Table.AddColumn(#"Added Custom", "Area", each 
            let 
                normalizeVacant = Text.Replace([Tenant],"Vacant","-x-"),
                split = List.RemoveItems(Text.Split(normalizeVacant,"-"),{""}),
                areas = List.Transform(
                            Text.Split(
                                fnRegexExtr([Areas],"\\([^)]+\\)"),"),"), 
                            each Number.From(Text.Remove(_,{")","("}))),
                tenant = List.Select(split, each Text.Contains(_,"SA MAJESTE") or Text.Contains(_,"PWGSC")),
                POStenant = List.PositionOf(split, tenant{0}),
                area =  areas{POStenant}
                
            in 
              area, Int64.Type),
        
        #"Removed Columns" = Table.RemoveColumns(#"Added Custom1",{"Tenant", "Areas"})
    in
        #"Removed Columns"
    

    Result from above
    enter image description here

    The following code shows exactly what you show in your example. It arbitrarily decides to add one (1) to the area selection if the split that contains the SA MAJ... is not at the beginning of the text string.

    This does not seem very robust to me, but maybe it will be good enough.

    let
    
    //Change next line to reflect actual data source
        Source = Excel.CurrentWorkbook(){[Name="Table23"]}[Content],
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"Tenant", type text}, {"Areas", type text}}),
    
    //Extract relevant Tenant
        #"Added Custom" = Table.AddColumn(#"Changed Type", "Tenants", each Text.Replace(
                fnRegexExtr([Tenant], "(SA MAJESTE.*?(?=\\s*\\())|(\\(\\s*\\d+\\))|(PWGSC)"),","," "), type text),
    
    //Extract relevant area
        #"Added Custom1" = Table.AddColumn(#"Added Custom", "Area", each 
            let 
                normalizeVacant = Text.Replace([Tenant],"Vacant","-x-"),
                split = List.RemoveItems(Text.Split(normalizeVacant,"-"),{""}),
                areas = List.Transform(
                            Text.Split(
                                fnRegexExtr([Areas],"\\([^)]+\\)"),"),"), 
                            each Number.From(Text.Remove(_,{")","("}))),
                tenant = List.Select(split, each Text.Contains(_,"SA MAJESTE") or Text.Contains(_,"PWGSC")){0},
                POS = List.PositionOf(split, tenant),
                POStenant = if Text.StartsWith(Text.Trim(tenant),"SA MAJ") or Text.StartsWith(Text.Trim(tenant),"PWGSC")
                                then POS else POS+1 ,
                area =  areas{POStenant}
                
            in 
                area, Int64.Type),
        
        #"Removed Columns" = Table.RemoveColumns(#"Added Custom1",{"Tenant", "Areas"})
    in
        #"Removed Columns"
    

    enter image description here