Search code examples
powershellloggingawktext

How to remove all CRLF's at position 51 in a file, processing the file in reverse order


I have a large text file that got word-wrapped at 50 characters by the application that outputs it.

The actual original unwrapped Line length varies wildly from 1 character up to 1500+ characters.

I need something that can process the file in reverse order (starting from the bottom) and remove every CRLF that sits at position 51, but leave all of the CRLF's at OTHER positions alone.

(Hence the reverse order. a 1500+ character line has like 56 CRLF's in it at every position 51. The last ones have to be removed first in order to preserve string integrity).

Due to the necessity of reverse order, as far as I can tell, this means sed is out. Regex find and replace in notepad++ doesn't have "backward direction" as a selectable option either.

I'm on windows. The file itself was generated via powershell, but I have python installed, node, cygwin via Cmder, and honestly would be willing to install just about anything for this, but wsl is currently out of the question due to a corporate policy. So is vbscript for that matter.

I tried various find and replace extended options in n++ but there are no consistent delineators aside from [CR][LF] at pos51.

Example -- with an attempt to preserve formatting:

COMMENT ON COLUMN "vendor"."things_andstuf_associa
tions"."id" IS 'The unique identifier for a things
andstuf association record.';

COMMENT ON COLUMN "vendor"."things_andstuf_associa
tions"."course_id" IS 'Identifies the course.';
COMMENT ON COLUMN "vendor"."things_andstuf_associa
tions"."created_at" IS 'Timestamp of when the reco
rd was created.';

COMMENT ON COLUMN is just one small section of thousands of lines of logging. some start with debug, some start with info, some start with SELECT some start with formatted dates, some start with UPSERT, some start with ON CONFLICT...it varies widely and wildly. --all lines do not end with a semicolon. blank lines being maintained would probably be preferable. –

There are no unique strings of formatted text specific to the start of each line. I may have to (and am willing to) accept that all lines that are complete at 50 characters exactly will be incorrectly merged along with all of the lines that are being correctly merged.

The output is coming from a compiled python application and is being captured via start-transcript in powershell. I cannot affect the output within powershell at the time it is being generated. I can, however, affect the transcript file that powershell outputs after the fact.

The only constant I am able to find is that wrapped lines have a CRLF at position 51.


Solution

  • OP has stated there are no common starts/ends for the true lines. Modifying OP's sample to show some variation:

    $ cat file.txt
    COMMENT ON COLUMN "vendor"."things_andstuf_associa
    tions"."id" IS 'The unique identifier for a things
    andstuf association record.';
    
    some other start  "vendor"."things_andstuf_associa
    tions"."course_id" IS 'Identifies the course.'
    yet another start|"vendor"."things_andstuf_associa
    tions"."created_at" IS 'Timestamp of when the reco
    rd was created.'
    

    OP has stated lines end with CR/LF (\r\n) so I insured my file ends with CR/LF:

    $ unix2dos file.txt
    $ file file.txt
    file.txt: ASCII text, with CRLF line terminators
    

    One awk idea:

    awk -v maxlen=51 '                                    # set awk variable "maxlen"
    length() <  maxlen { print line $0; line = "" }       # if length < 51 then print current value of "line" variable plus current line ($0); reset/clear "line" variable
    length() == maxlen { sub(/\r$/,""); line = line $0 }  # if length = 51 then strip the CR (\r) character and append to "line" variable
    END                { if (line != "") print line }     # at end of file print "line" if not empty
    ' file.txt > newfile.txt
    
    #### one-liner sans comments:
    
    awk -v maxlen=51 'length()<maxlen {print line $0; line=""} length()==maxlen {sub(/\r$/,""); line=line $0} END {if (line!="") print line}' file.txt > newfile.txt
    

    NOTES:

    • if the max length will always be 51 then -v maxlen=51 could be removed and all other references to maxlen would then be replaced with 51
    • if a true/complete line is exactly 50 characters (+ \r) in length this will erroneously merge said line with the next line from the file

    This generates:

    $ cat newfile.txt
    COMMENT ON COLUMN "vendor"."things_andstuf_associations"."id" IS 'The unique identifier for a thingsandstuf association record.';
    
    some other start  "vendor"."things_andstuf_associations"."course_id" IS 'Identifies the course.'
    yet another start|"vendor"."things_andstuf_associations"."created_at" IS 'Timestamp of when the record was created.'