Search code examples
perlcsvline-breaks

Remove mysterious line breaks in CSV file using Perl


I have a CSV file that I'm parsing using Perl. The file is a BOM produced by Solidworks 2015 that was saved as an XLS file, then opened in Excel and saved as a CSV file.

There are cells that have line breaks. When I read a line with such a cell from the file, the line comes in with the line breaks. For example, here is one of the lines read looks like this:

74,,74,1,1,"SJ-TL303202-DET-074-
001",PDSI,"2.25"" DIA. X  8.00""",A2,513,1,

It reads in as a single line in Perl.

When I turn the Show All Characters in Notepad++, I can see the line breaks are cause by [CR][LF].

So I thought this would work to remove the line feeds:

$line =~ s/[\r\n]+//g;

but it does not.


Solution

  • You don't give much of a sample of your CSV data, but what you show is perfectly valid. A text field may contain newlines if you wish, as long as it is enclosed in double-quotes

    The Text::CSV module will process it quite happily as long as you enable the binary option in the constructor call, and you may reformat the data as you wish before you write it back out again

    This program expects the path to the input file as a parameter on the command line, and it will write the modified data to STDOUT, which you can redirect on the command line, like this

    $ perl fix_csv.pl input.csv > output.csv
    

    I've assumed that your data contains only 7-bit ASCII data, and it should work whether you're running it on a Windows system or on Linux

    use strict;
    use warnings 'all';
    
    my ($csv_file) = @ARGV;
    
    use Text::CSV;
    
    open my $fh, '<', $csv_file or die qq{Unable to open "$csv_file" for input: $!};
    
    my $csv = Text::CSV->new( { binary => 1 } );
    
    while ( my $row = $csv->getline( $fh ) ) {
    
        tr/\r\n//d for @$row;
    
        $csv->combine(@$row);
        print $csv->string, "\n";
    }
    

    output

    74,,74,1,1,SJ-TL303202-DET-074-001,PDSI,"2.25"" DIA. X  8.00""",A2,513,1,