Search code examples
regexpowershellmultivalue

PowerShell - regex for changing delimiter in multi-value field


I have a tab delimited data set as such (some columns removed but not relevant here):

XYZ 12345671    18884   18912   222.00  222.00  Standard    2   CMD.1
ABC 12345672    18875   397.00  Standard    1   CMD.2
ABC 12345673    18889   18919   18950   146.66  146.66  146.68  Standard    3   CMD.2
XYZ 12345674    18840   18871   18901   193.00  193.00  192.00  Standard    3   CMD.1

Unfortunately in the data set there are two multi-value fields (with 1 or more sub-values) using the same delimiter as every other column. All of the 5 digit numbers starting with 18 are one field, then all of the decimal numbers are a second field. The number of sub-values will always match between these fields (that is, if the first has 2 values the second will have 2 values). The lonae number by itself between "Standard" and "CMD.X" is the number of sub-values in each of those fields. Basically, what I'm trying to create is:

XYZ 12345671    18884;18912 222.00;222.00   Standard    2   CMD.1
ABC 12345672    18875   397.00  Standard    1   CMD.2
ABC 12345673    18889;18919;18950   146.66;146.66;146.68    Standard    3   CMD.2
XYZ 12345674    18840;18871;18901   193.00;193.00;192.00    Standard    3   CMD.1

Due to the filesize, I am trying to use this with System.IO.File (as many as 100,000k records):

$File = [System.IO.File]::ReadAllText('in_file.txt')
    ForEach($RecordSet In $File){
        $RecordSet = $RecordSet -CReplace '(?m)(/d/d/d/d/d)`t(/d/d/d/d/d)','(/d/d/d/d/d);(/d/d/d/d/d)'
        $RecordSet = $RecordSet -CReplace '(?m)([0-9]*\.[0-9]+|[0-9]+)`t([0-9]*\.[0-9]+|[0-9]+)','([0-9]*\.[0-9]+|[0-9]+);([0-9]*\.[0-9]+|[0-9]+)'

    Add-Content ('out_file.txt") $RecordSet
}

Yes, I realize the in_file.txt is a stupid format -- it's not changeable unfortunately.


Solution

  • Method 1

    Maybe,

    (?<=\s\d{5}\b)\s+(?=\b\d{5}\b)|(?<=\s\d{3}\.\d{2}\b)\s+(?=\b\d{3}\.\d{2}\b)
    

    and a replacement of ; might be OK to look into.

    Demo 1

    Method 2

    or you can likely capture those with an expression similar to,

    (\d+)\s{3}(\d+)\s{3}(?:(\d+)\s{3})?(\d+\.\d+)\s{2}(\d+\.\d+)\s{2}(?:(\d+\.\d+)\s{2})?Standard
    

    then, replace with $1;$2;$3;$4;$5;$6; or $1;$2;$4;$5;.


    If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.