I have a large text file that got word-wrapped at 50 characters by the application that outputs it.
The actual original unwrapped Line length varies wildly from 1 character up to 1500+ characters.
I need something that can process the file in reverse order (starting from the bottom) and remove every CRLF that sits at position 51, but leave all of the CRLF's at OTHER positions alone.
(Hence the reverse order. a 1500+ character line has like 56 CRLF's in it at every position 51. The last ones have to be removed first in order to preserve string integrity).
Due to the necessity of reverse order, as far as I can tell, this means sed is out. Regex find and replace in notepad++ doesn't have "backward direction" as a selectable option either.
I'm on windows. The file itself was generated via powershell, but I have python installed, node, cygwin via Cmder, and honestly would be willing to install just about anything for this, but wsl is currently out of the question due to a corporate policy. So is vbscript for that matter.
I tried various find and replace extended options in n++ but there are no consistent delineators aside from [CR][LF] at pos51.
Example -- with an attempt to preserve formatting:
COMMENT ON COLUMN "vendor"."things_andstuf_associa
tions"."id" IS 'The unique identifier for a things
andstuf association record.';
COMMENT ON COLUMN "vendor"."things_andstuf_associa
tions"."course_id" IS 'Identifies the course.';
COMMENT ON COLUMN "vendor"."things_andstuf_associa
tions"."created_at" IS 'Timestamp of when the reco
rd was created.';
COMMENT ON COLUMN is just one small section of thousands of lines of logging. some start with debug, some start with info, some start with SELECT some start with formatted dates, some start with UPSERT, some start with ON CONFLICT...it varies widely and wildly. --all lines do not end with a semicolon. blank lines being maintained would probably be preferable. –
There are no unique strings of formatted text specific to the start of each line. I may have to (and am willing to) accept that all lines that are complete at 50 characters exactly will be incorrectly merged along with all of the lines that are being correctly merged.
The output is coming from a compiled python application and is being captured via start-transcript in powershell. I cannot affect the output within powershell at the time it is being generated. I can, however, affect the transcript file that powershell outputs after the fact.
The only constant I am able to find is that wrapped lines have a CRLF at position 51.
OP has stated there are no common starts/ends for the true lines. Modifying OP's sample to show some variation:
$ cat file.txt
COMMENT ON COLUMN "vendor"."things_andstuf_associa
tions"."id" IS 'The unique identifier for a things
andstuf association record.';
some other start "vendor"."things_andstuf_associa
tions"."course_id" IS 'Identifies the course.'
yet another start|"vendor"."things_andstuf_associa
tions"."created_at" IS 'Timestamp of when the reco
rd was created.'
OP has stated lines end with CR/LF (\r\n
) so I insured my file ends with CR/LF:
$ unix2dos file.txt
$ file file.txt
file.txt: ASCII text, with CRLF line terminators
One awk
idea:
awk -v maxlen=51 ' # set awk variable "maxlen"
length() < maxlen { print line $0; line = "" } # if length < 51 then print current value of "line" variable plus current line ($0); reset/clear "line" variable
length() == maxlen { sub(/\r$/,""); line = line $0 } # if length = 51 then strip the CR (\r) character and append to "line" variable
END { if (line != "") print line } # at end of file print "line" if not empty
' file.txt > newfile.txt
#### one-liner sans comments:
awk -v maxlen=51 'length()<maxlen {print line $0; line=""} length()==maxlen {sub(/\r$/,""); line=line $0} END {if (line!="") print line}' file.txt > newfile.txt
NOTES:
-v maxlen=51
could be removed and all other references to maxlen
would then be replaced with 51
\r
) in length this will erroneously merge said line with the next line from the fileThis generates:
$ cat newfile.txt
COMMENT ON COLUMN "vendor"."things_andstuf_associations"."id" IS 'The unique identifier for a thingsandstuf association record.';
some other start "vendor"."things_andstuf_associations"."course_id" IS 'Identifies the course.'
yet another start|"vendor"."things_andstuf_associations"."created_at" IS 'Timestamp of when the record was created.'