Search code examples
regexwindowsperlmatchcarriage-return

In Perl, how to match two consecutive Carriage Returns?


Hi StackOverflow buddies,

I'm on Windows platform; I have a data file but something wrong happened and (I don't know why) all combinations of "Carriage Return + New Line" became "Carriage Return + Carriage Return + New Line", (190128 edit:) for example:

When viewing the file as plain text, it is:

Text file in plain text (with unprintable stuffs)

When viewing the same file in hex mode, it is:

Text file in hex mode, can see the double "0D"s

Out of practical purposes I need to remove the extra "0D" in double "0D"s like ".... 30 30 0D 0D 0A 30 30 ....", and change it to ".... 30 30 0D 0A 30 30 ....".

190129 edit: Besides, to ensure that my problem can be reproduced, I uploaded my data file to GitHub at URL (should download & unzip it before using; in a binary \ hex editor you can 0D 0D 0A in the first line): https://github.com/katyusza/hello_world/blob/master/ram_init.zip

I used the following Perl script to remove the extra Carriage Return, but to my astonishment my regex just do NOT work!! My entire code is (190129 edit: past entire Perl script here):

use warnings            ;
use strict              ;
use File::Basename      ;

#-----------------------------------------------------------
# command line handling, file open \ create
#-----------------------------------------------------------

# Capture input input filename from command line:
my $input_fn = $ARGV[0] or
die "Should provide input file name at command line!\n";

# Parse input file name, and generate output file name:
my ($iname, $ipath, $isuffix) = fileparse($input_fn, qr/\.[^.]*/);
my $output_fn = $iname."_pruneNonPrintable".$isuffix;

# Open input file:
open (my $FIN, "<", $input_fn) or die "Open file error $!\n";

# Create output file:
open (my $FO, ">", $output_fn) or die "Create file error $!\n";


#-----------------------------------------------------------
# Read input file, search & replace, write to output
#-----------------------------------------------------------

# Read all lines in one go:
$/ = undef;

# Read entire file into variable:
my $prune_txt = <$FIN> ;

# Do match & replace:
 $prune_txt =~ s/\x0D\x0D/\x0D/g;          # do NOT work.
# $prune_txt =~ s/\x0d\x0d/\x30/g;          # do NOT work.
# $prune_txt =~ s/\x30\x0d/\x0d/g;          # can work.
# $prune_txt =~ s/\x0d\x0d\x0a/\x0d\x0a/gs; # do NOT work.

# Print end time of processing:
print $FO $prune_txt  ;

# Close files:
close($FIN)     ;
close($FO)      ;

I did everything I could to match two consecutive Carriage Returns, but failed. Can anyone please point out my mistake, or tell me the right way to go? Thanks in advance!


Solution

  • On Windows, file handles have a :crlf layer given to them by default.

    • This layer converts CR LF to LF on read.
    • This layer converts LF to CR LF on write.

    Solution 1: Compensate for the :crlf layer.

    You'd use this solution if you want to end up with system-appropriate line endings.

    # ... read ...      # CR CR LF ⇒ CR LF
    s/\r+\n/\n/g;       # CR LF    ⇒ LF
    # ... write ...     # LF       ⇒ CR LF
    

    Solution 2: Remove the :crlf layer.

    You'd use this solution if you want to end up with CR LF unconditionally.

    Use <:raw and >:raw instead of < and > as the mode.

    # ... read ...      # CR CR LF ⇒ CR CR LF
    s/\r*\n/\r\n/g;     # CR CR LF ⇒ CR LF
    # ... write ...     # CR LF    ⇒ CR LF