I have these PDFs files and I want to extract the values in red and arrange them in a tabular form as shown below. I have attached the 3 sample PDFs in the link below. My problem is that sometimes the format of the PDFs add spaces between the tables and the DATE needs to be extracted and put in a column. The DATE is currently not in a table.
Is there a way I can combine the 3 sample PDFs in a single file using PowerQuery? Thanks for the anticipated help.
let
Source = Folder.Files("C:\Users\Dav\Downloads\test"),
#"Filtered Rows" = Table.SelectRows(Source, each [Extension] = ".pdf"),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Custom", each Pdf.Tables([Content], [Implementation="1.3"])),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each [Custom]{[Id="Page001"]}[Data]),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom1",{"Custom.1"}),
#"Expanded Custom.1" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom.1", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8"}),
#"Added Custom2" = Table.AddColumn(#"Expanded Custom.1", "Custom", each let a = Record.ToList(_),
b = if [Column1] = null then List.Transform(a, each try DateTime.FromText(_) otherwise null) else {},
c = List.First( List.RemoveNulls(b))
in c),
#"Filled Down" = Table.FillDown(#"Added Custom2",{"Custom"}),
#"Filtered Rows1" = Table.SelectRows(#"Filled Down", each [Column4] <> null),
#"Filtered Rows2" = Table.SelectRows(#"Filtered Rows1", each [Column1] <> "Data"),
#"Filtered Rows3" = Table.SelectRows(#"Filtered Rows2", each [Column1] <> "Data2"),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows3",{"Column7", "Column8"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Column1", "Data"}, {"Column2", "Test"}, {"Column3", "Test_YTD"}, {"Column4", "Test2"}, {"Column5", "Pai"}, {"Column6", "Temp"}, {"Custom", "Date"}})
in
#"Renamed Columns"