Search code examples
powershellcsvimport-csvexport-csv

Merge rows and split content from one .csv to multiple files using powershell


As mentioned in https://stackoverflow.com/a/66565662/14210760 I'd like to have a second type of output for the given data:

header1; header2; header3; header4; header5; header6; header7; header8; header9; header10; header11; header12; header13;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654321; EUR
CD; 456789; 22.24; Text; SW;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654322; EUR
CD; 354345; 85.45; Text; SW;
CD; 123556; 94.63; Text; SW;
CD; 354564; 12.34; Text; SW;
CD; 135344; 32.23; Text; SW;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654323; EUR
CD; 354564; 12.34; Text; SW;
CD; 852143; 34.97; Text; SW;

This time the AB rows should always be in front of the CD rows. I know it's redundand but it'll make every single row a whole set of data. The desired outcome would be: BC987654321.csv

header1; header2; header3; header4; header5; header6; header7; header8; header9; header10; header11; header12; header13;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654321; EUR; 12345; CD; 456789; 22.24; Text; SW;

BC987654322.csv

header1; header2; header3; header4; header5; header6; header7; header8; header9; header10; header11; header12; header13;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654322; EUR; 12345; CD; 354345; 85.45; Text; SW;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654322; EUR; 12345; CD; 123556; 94.63; Text; SW;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654322; EUR; 12345; CD; 354564; 12.34; Text; SW;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654322; EUR; 12345; CD; 135344; 32.23; Text; SW;

BC987654323.csv

header1; header2; header3; header4; header5; header6; header7; header8; header9; header10; header11; header12; header13;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654323; EUR; 12345; CD; 354564; 12.34; Text; SW;
AB; 12345; AB123456789; 10.03.2021; GT; BC987654323; EUR; 12345; CD; 852143; 34.97; Text; SW;

Thank you in advance


Solution

  • For this we need to be more creative and use temporary Hashtables.

    Something like this:

    $path = 'D:\Test'
    $fileIn = Join-Path -Path $path -ChildPath 'input.csv'
    $fileOut = $null   # will get a value in the loop
    $splitValue = 'AB' # the header1 value that decides to start a new file
    $csv = Import-Csv -Path $fileIn -Delimiter ';'
    # get an array of the column headers
    $allHeaders = $csv[0].PsObject.Properties.Name
    
    ## create a new variable containing
    
    $hash = [ordered]@{}
    foreach ($item in $csv) {
        if ($item.header1 -eq $splitValue) { 
            # start a new row (build a new hash)
            $hash.Clear()
            $item.PsObject.Properties | Where-Object { $_.Value } | ForEach-Object { $hash[$_.Name] = $_.Value } 
            # get the filename from header6
            $fileOut = Join-Path -Path $path -ChildPath ('{0}.csv' -f $item.header6)
            # if a file with that name already exists, delete it
            if (Test-Path -Path $fileOut -PathType Leaf) { Remove-Item -Path $fileOut }
        }
        elseif ($hash.Count) {
            # copy the hash which holds the beginning of the line to a temporary row hash (the 'AB' line)
            $rowHash = [ordered]@{}
            foreach ($name in $hash.Keys) { $rowHash[$name] = $hash[$name] }
            $headerIndex = $hash.Count
            # append the new fields from this line to the row hash
            $item.PsObject.Properties | Where-Object { $_.Value } | ForEach-Object {
                # for safety: test if we do not index out of the $allHeaders array
                $header = if ($headerIndex -lt $allHeaders.Count) { $allHeaders[$headerIndex] } else { "header$($headerIndex + 1)" }
                $rowHash[$header] = $_.Value 
                $headerIndex++  # increment the counter
            }
            # append trailing headers with empty value
            while ($headerIndex -lt $allHeaders.Count) { 
                $rowHash[$allHeaders[$headerIndex++]] = $null
            }
            # cast the finalized rowhash into a [PsCustomObject]
            $newRow = [PsCustomObject]$rowHash
            # write the completed row in the csv file
            ##$fileOut = Join-Path -Path $path -ChildPath ('{0}.csv' -f $newRow.header6)
            # if the file already exists, we append, otherwise we create a new file
            $append = Test-Path -Path $fileOut -PathType Leaf
            $newRow | Export-Csv -Path $fileOut -Delimiter ';' -NoTypeInformation -Append:$append
        }
        else {
            Write-Warning "Could not find a starting row (header1 = '$splitValue') for the file"
        }
     }
    

    Output:

    BC987654321.csv

    "header1";"header2";"header3";"header4";"header5";"header6";"header7";"header8";"header9";"header10";"header11";"header12";"header13"
    "AB";"12345";"AB123456789";"10.03.2021";"GT";"BC987654321";"EUR";"CD";"456789";"22.24";"Text";"SW";
    

    BC987654322.csv.csv

    "header1";"header2";"header3";"header4";"header5";"header6";"header7";"header8";"header9";"header10";"header11";"header12";"header13"
    "AB";"12345";"AB123456789";"10.03.2021";"GT";"BC987654322";"EUR";"CD";"354345";"85.45";"Text";"SW";
    "AB";"12345";"AB123456789";"10.03.2021";"GT";"BC987654322";"EUR";"CD";"123556";"94.63";"Text";"SW";
    "AB";"12345";"AB123456789";"10.03.2021";"GT";"BC987654322";"EUR";"CD";"354564";"12.34";"Text";"SW";
    "AB";"12345";"AB123456789";"10.03.2021";"GT";"BC987654322";"EUR";"CD";"135344";"32.23";"Text";"SW";
    

    BC987654323.csv.csv

    "header1";"header2";"header3";"header4";"header5";"header6";"header7";"header8";"header9";"header10";"header11";"header12";"header13"
    "AB";"12345";"AB123456789";"10.03.2021";"GT";"BC987654323";"EUR";"CD";"354564";"12.34";"Text";"SW";
    "AB";"12345";"AB123456789";"10.03.2021";"GT";"BC987654323";"EUR";"CD";"852143";"34.97";"Text";"SW";
    

    EDIT

    The above works on the sample data given in the question, but relies heavily on the fact that no fields that matter can be empty.

    As you have commented, the real csv does have empty fields and because of that, the code shifts the data into the wrong columns where that happens.

    Using your real data, this should do a lot better:

    $path       = 'D:\Test'
    $fileIn     = Join-Path -Path $path -ChildPath 'input.csv'
    $fileOut    = $null   # will get a value in the loop
    $splitValue = 'IH'    # the value in the first column ($idColumn) that decides to start a new file. (in example data 'AB')
    $csv        = Import-Csv -Path $fileIn -Delimiter ';'
    
    # get an array of all the column headers
    $allHeaders = $csv[0].PsObject.Properties.Name   # a string array of all header names
    # get the index of the first column to start appending from ("Identifier")
    $idColumn   = $allHeaders[0]                     # --> 'Record Identifier'  (in example data 'header1')
    
    $mergeIndex = [array]::IndexOf($allHeaders, "Identifier")  # this is Case-Sensitive !
    # if you want to do this case-insensitive, you need to do something like
    # $mergeIndex = [array]::IndexOf((($allHeaders -join ';').ToLowerInvariant() -split ';'), "identifier")
    
    # create an ordered hash that will contain the values up to column no. $mergeIndex
    $hash = [ordered]@{}
    foreach ($item in $csv) {
        if ($item.$idColumn -eq $splitValue) { 
            # start a new row (build a new hash)
            $hash.Clear()
            for ($i = 0; $i -lt $mergeIndex; $i++) {
                $hash[$allHeaders[$i]] = $item.$($allHeaders[$i])  # we need $(..) because of the spaces in the header names
            }
    
            # get the filename from the 6th header $item.$($allHeaders[5]) --> 'VAT Number'
            $fileOut = Join-Path -Path $path -ChildPath ('{0}.csv' -f $item.'VAT Number')
            # if a file with that name already exists, delete it
            if (Test-Path -Path $fileOut -PathType Leaf) { Remove-Item -Path $fileOut }
        }
        elseif ($hash.Count) {
            # create a new ordered hashtable to build the entire line with
            $rowHash = [ordered]@{}
            # copy the hash which holds the beginning of the line to a temporary row hash (the 'IH' line)
            # an ordered hashtable does not have a .Clone() method unfortunately..
            foreach ($name in $hash.Keys) { $rowHash[$name] = $hash[$name] }
    
            # append the fields from this item to the row hash starting at the $mergeIndex column
            $j = 0
            for ($i = $mergeIndex; $i -lt $allHeaders.Count; $i++) {
                $rowHash[$allHeaders[$i]] = $item.PsObject.Properties.Value[$j++]
            }
    
            # cast the finalized rowhash into a [PsCustomObject] and add to the file
            [PsCustomObject]$rowHash | Export-Csv -Path $fileOut -Delimiter ';' -NoTypeInformation -Append
        }
        else {
            Write-Warning "Could not find a starting row ('$idColumn' = '$splitValue') for the file"
        }
     }
    

    Note I do not show the output here because perhaps the real csv shows sensitive data