I have a massive text/csv file with is 6 GB big. When it was created an error happened and some new line characters (CRLF) have not been removed from fields so certain lines are broken
Here a simplified version:
e.g
Field1<TAB>Field2<TAB>Field3<TAB>Field4
Field1<TAB>Field2<TAB>Field3<TAB>Field4
Field1<TAB>Field2<TAB>Field3
<TAB>Field4
Field1<TAB>Field2<TAB>Field3<TAB>Field4
So field 3 in line 3 has a CR and therefore the line is broken
I don't want to recreate that CSV file which would take too long but there must be a way to fix this maybe with the help of regular expressions and a tool.
It's easy to identify broken lines. They are less than 50 characters long. All good lines are longer than 50 characters
So I need a step which: * identifies short lines * removes the CRLF in front of that line * does this for the whole file
I can create a macro in UltraEdit and search for Perl Regex
^.{0,50}$
and replace the CRLF in front. That works but takes way too long. Macros in UltraEdit are handy but very slow.
Is there an other way? Can I use a regex with some tool to search/replace?
Thanks, Wolfgang
You can search for:
^(.{1,50})\n(.{1,50}\n)
and replace with:
$1$2
Demo (for broken lines at 30 characters or less instead): https://regex101.com/r/pr5JhW/1