Search code examples
regexcsvtext-filesultraedit

Remove CR from text file if subsequent line is shorter than X characters


I have a massive text/csv file with is 6 GB big. When it was created an error happened and some new line characters (CRLF) have not been removed from fields so certain lines are broken

Here a simplified version:

e.g

Field1<TAB>Field2<TAB>Field3<TAB>Field4
Field1<TAB>Field2<TAB>Field3<TAB>Field4
Field1<TAB>Field2<TAB>Field3
<TAB>Field4
Field1<TAB>Field2<TAB>Field3<TAB>Field4

So field 3 in line 3 has a CR and therefore the line is broken

I don't want to recreate that CSV file which would take too long but there must be a way to fix this maybe with the help of regular expressions and a tool.

It's easy to identify broken lines. They are less than 50 characters long. All good lines are longer than 50 characters

So I need a step which: * identifies short lines * removes the CRLF in front of that line * does this for the whole file

I can create a macro in UltraEdit and search for Perl Regex

^.{0,50}$ 

and replace the CRLF in front. That works but takes way too long. Macros in UltraEdit are handy but very slow.

Is there an other way? Can I use a regex with some tool to search/replace?

Thanks, Wolfgang


Solution

  • You can search for:

    ^(.{1,50})\n(.{1,50}\n)
    

    and replace with:

    $1$2
    

    Demo (for broken lines at 30 characters or less instead): https://regex101.com/r/pr5JhW/1