Search code examples
regexpowershellexport-to-csvregex-group

Parsing text file into csv. Issue splitting string that spans across multiple lines via PowerShell regex


I'm attempting to parse a txt file into a csv with 4 headers. However, I'm having some troubles successfully splitting/matching text between two strings that spans across multiple lines. I'm not very experienced with regex. Is there a way to do this without regex? Or does anyone know which combination of regex expressions I could use here?

I started with the following code but welcome any suggestions/recommendations if there's a better way to go about this:

$input = Get-Content ‘C:\Example.txt’
$output = ‘C:\Example.csv’
$regex = "'PRINT(.*?\n)PRINT'"
$regex2 = "'INSERT' -or $_ -eq 'UPDATE'"
$matches = [regex]::Matches($input, $regex)::multiline
$array = @()    

$matches | for each-object{
    $write-object = False
    $obj = New-Object System.Object
        If $_ -match $regex
    }
If ($writeobj){
    $obj | Add-Member -type NoteProperty -name row1 -value $row
    $array += $obj
}
$array | Export-csv -path $output

DESIRED OUTPUT:

SECTION UPDATE/INSERT TABLE CHANGE BLOCK
AA UPDATE Table_1 COL_1 = ‘123’ WHERE ID= ‘1’
BB INSERT Table_2 (DBO.COL_1) SELECT Col1Value AS [Col_1] FROM Table_1
CC UPDATE Table_1 COL_3
CC INSERT Table_2 (DBO.COL_3) SELECT Col3Value AS [Col_3] FROM Table_2

EXAMPLE .TXT CONTENT:

--AA***************************
--PRINT AA
UPDATE Table_1 
SET COL_1 = '123'
WHERE ID = '1'

--BB***************************
--PRINT BB
INSERT INTO [dbo].[Table_2] (DBO.COL_1)
SELECT Col1Value AS [Col_1] 
FROM  Table_1  

--CC***************************
--PRINT CC
UPDATE Table_1 
SET COL_3 = 'ABC'
WHERE ID = '3' 
AND ID2 = '1a' 

INSERT INTO [dbo].[Table_2] (DBO.COL_3)
SELECT Col3Value AS [Col_3] 
FROM  Table_2

Solution

  • My suggestion is to read the entire file in as a multi-line string using the -Raw switch parameter. Then you can split the file by sections (AA, BB, etc) and parse each section with regex. I do a bunch of -replace to strip out just the info I want for each property here.

    $input = Get-Content 'C:\Example.txt'
    $output = 'C:\Example.csv'
    $array = $input -split '(?ms)(?=--[^\*^\r^\n]+?\*{3,}[^\*])'|%{
        $Record = $_ -replace '(?ms)^--([^\*]+)\*.+$','$1'
        Switch -RegEx (($_ -split '(?ms)(?=UPDATE|INSERT)')){
            '^UPDATE' {[PSCustomObject]@{SECTION=$Record;'UPDATE/INSERT'='UPDATE';TABLE=$_ -replace '(?ms)UPDATE (.+?)[\r\n].*','$1';'CHANGE BLOCK'=$_ -replace 'UPDATE.+?[\r\n]+' -replace '[\r\n]+',' ' -replace '^SET '}}
            '^INSERT' {[PSCustomObject]@{SECTION=$Record;'UPDATE/INSERT'='INSERT';TABLE=$_ -replace '(?ms)INSERT INTO \S+?\[([^\]]+)].*','$1';'CHANGE BLOCK'=$_ -replace '(?ms)INSERT INTO \S+?\[([^\]]+)]\s*' -replace '[\r\n]+',' '}}
        }
    }
    $array | Export-csv -path $output
    

    That provided me with:

    SECTION UPDATE/INSERT TABLE    CHANGE BLOCK                                             
    ------- ------------- -----    ------------                                             
    AA      UPDATE        Table_1  COL_1 = '123' WHERE ID = '1'                             
    BB      INSERT        Table_2  (DBO.COL_1) SELECT Col1Value AS [Col_1]  FROM  Table_1   
    CC      UPDATE        Table_1  COL_3 = 'ABC' WHERE ID = '3'  AND ID2 = '1a'             
    CC      INSERT        Table_2  (DBO.COL_3) SELECT Col3Value AS [Col_3]  FROM  Table_2