Search code examples
excelpowerbipowerquerypowerbi-desktopm

Combining PDFS of slightly different formats using Power Query


I have these PDFs files and I want to extract the values in red and arrange them in a tabular form as shown below. I have attached the 3 sample PDFs in the link below. My problem is that sometimes the format of the PDFs add spaces between the tables and the DATE needs to be extracted and put in a column. The DATE is currently not in a table.

Is there a way I can combine the 3 sample PDFs in a single file using PowerQuery? Thanks for the anticipated help.

https://gofile.io/d/iNtyNc

enter image description here


Solution

  • enter image description here

    let
        Source = Folder.Files("C:\Users\Dav\Downloads\test"),
        #"Filtered Rows" = Table.SelectRows(Source, each [Extension] = ".pdf"),
        #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Custom", each Pdf.Tables([Content], [Implementation="1.3"])),
        #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each [Custom]{[Id="Page001"]}[Data]),
        #"Removed Other Columns" = Table.SelectColumns(#"Added Custom1",{"Custom.1"}),
        #"Expanded Custom.1" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom.1", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8"}),
        #"Added Custom2" = Table.AddColumn(#"Expanded Custom.1", "Custom", each let a = Record.ToList(_),
    b = if [Column1] = null then List.Transform(a, each try DateTime.FromText(_) otherwise null) else {},
    c = List.First( List.RemoveNulls(b))
    in c),
        #"Filled Down" = Table.FillDown(#"Added Custom2",{"Custom"}),
        #"Filtered Rows1" = Table.SelectRows(#"Filled Down", each [Column4] <> null),
        #"Filtered Rows2" = Table.SelectRows(#"Filtered Rows1", each [Column1] <> "Data"),
        #"Filtered Rows3" = Table.SelectRows(#"Filtered Rows2", each [Column1] <> "Data2"),
        #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows3",{"Column7", "Column8"}),
        #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Column1", "Data"}, {"Column2", "Test"}, {"Column3", "Test_YTD"}, {"Column4", "Test2"}, {"Column5", "Pai"}, {"Column6", "Temp"}, {"Custom", "Date"}})
    in
        #"Renamed Columns"