Search code examples
csvperltextnewline

Handling `CRLF` line endings in substitution and `split` in Perl


I need to process many files (with CRLF line endings) that look like that:

$ cat -v file1.txt
1$XXX$ZZZ$$$$$$$$^M
2$AAA$BBB$$$$$$$$^M

$ cat -v file2.txt
1$4668$$$^M
2$46$$$^M

I need to:

  • remove the last $ sign,
  • change all $ to ,,
  • enclose each field in double quotes,
  • rename the files.

Desired output (no matter if line endings are CRLF or LF):

$ cat newname1.csv
"1","XXX","ZZZ","","","","","","",""
"2","AAA","BBB","","","","","","",""

$ cat newname2.csv
"1","4668","",""
"2","46","",""

Here is my attempt:

#!/usr/bin/perl

use strict;
use warnings;

my %inputs = qw(
  file1 file1.txt
  file2 file2.txt
);

my %outputs = qw(
  file1 newname1.csv
  file2 newname2.csv
);

for my $key (keys %inputs) {
  
  open my $in, '<', $inputs{$key} or die $!;
  open my $out, '>', $outputs{$key} or die $!;
  
  while(<$in>) {
    local $, = ',';
    local $\ = "\n";
    s/\$$//;
    my @row = split /\$/;
    print $out map qq("$_"), @row;
  }
  
  close $in or die $!;
  close $out or die $!;
  
}

On Linux, it gives files with a CRLF enclosed in the last column and LF line endings:

$ cat -v newname1.csv
"1","XXX","ZZZ","","","","","","","","^M
"
"2","AAA","BBB","","","","","","","","^M
"

$ cat -v newname2.csv
"1","4668","","","^M
"
"2","46","","","^M
"

I guess the issue is due to CRLF line endings. Therefore, I tried:

  • to change '<' to '<:crlf' to open my files, which gives the same result;
  • to use other regex to match the last $ sign (e.g. \$\r\n and \$\R, which both result in files without the empty trailing columns).

How can I fix my code to get my desired output?


Solution

  • Update: This answer was written for the first two versions of the question. I only undeleted it because the OP asked me to. It may not fit the current version of the question. Some things might be outright wrong.


    This has nothing to do with line endings being CRLF. It is just a split issue.

    If I add a Dumper print to your code, where you have split into the variable @row

    my @row = split /\$/;
    use Data::Dumper;
    print Dumper \@row;
    

    I get (for the first field):

    $VAR1 = [
              '1',
              '4668',
              '',
              '',
              '
    '
            ];
    

    Where you can see the trailing newline is the last field in your split.

    When you then treat these split results as genuine column values in your data, you get 1 field added for the newline.

    I do not see where you are removing the last $. Maybe that is something you misunderstood?

    Suggested solution:

    If this is csv data, you should use a csv module to handle it. The Text::CSV module does this well. Here's a sample code that will handle your inputs:

    use strict;
    use warnings;
    use Text::CSV qw(csv);
    
    my %inputs = qw(
      file1 file1.txt
      file2 file2.txt
    );
    
    my %outputs = qw(
      file1 newname1.csv
      file2 newname2.csv
    );
    
    for my $key (keys %inputs) {
        my $aoa = csv (in => $inputs{$key}, sep_char => '$');
        csv (in => $aoa, out => $outputs{$key}, sep_char => ',', always_quote => 1);
    }
    

    Update:

    Since you edited your question and added a line of code that changes everything and makes your own claimed output "wrong", I've found the following:

    If you have only trailing empty fields, split will delete those empty fields by default. This can be fixed, as specified in the documentation for split:

    If LIMIT is negative, it is treated as if it were instead arbitrarily large; as many fields as possible are produced.

    If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved); if all fields are empty, then all fields are considered to be trailing (and are thus stripped in this case).

    In other words, you can change

    split /\$/;
    

    to

    split /\$/, $_, -1;
    

    to fix your missing trailing empty fields.

    The only problem is that you have not reported having this problem (yet). So, I guess we need to wait for you to update your question.