Search code examples
bashcsvunixcut

Splitting huge CSV file


I have a huge csv file which is about 20 GB. It has 5,000 columns and 2,500,000 rows. I want to write each column of that into one file. I already tried FOR loop, but it is pretty slow. My code is below:

Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
for i in `seq 1 $Columns`;
do
    echo $i
    tail -n +2 train.csv | cut -d',' -f$i > cols/col_$i.txt
done 

I appropriate any suggestion to accelerate this.


Solution

  • Perl solution. It openes 1000 files at once, so it will pass over your input 5 times. Run with the input filename as parameter.

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    my $inputfile = shift;
    open my $input, '<', $inputfile or die $!;
    
    mkdir 'cols';
    
    my @headers = split /,/, <$input>;
    chomp $headers[-1];
    my $pos = tell $input;  # Remember where the first data line starts.
    
    my $step = 1000;
    for (my $from = 0; $from <= $#headers; $from += $step) {
        my $to = $from + $step - 1;
        $to = $#headers if $#headers < $to;
        warn "$from .. $to";
    
        # Open the files and print the headers in range.    
        my @fhs;
        for ($from .. $to) {
            open $fhs[ $_ - $from ], '>', "cols/col-$_" or die $!;
            print { $fhs[ $_ - $from ] } $headers[$_], "\n";
        }
    
        # Print the columns in range.
        while (<$input>) {
            chomp;
            my $i = 0;
            print { $fhs[$i++] } $_, "\n" for (split /,/)[ $from .. $to ];
        }
        close for @fhs;
        seek $input, $pos, 0;  # Go back to the first data line.
    }