Search code examples
linuxbashgziptarbzip2

Efficient transfer of console data, tar & gzip/ bzip2 without creating intermediary files


Linux environment. So, we have this program 't_show', when executed with an ID will write price data for that ID on the console. There is no other way to get this data.

I need to copy the price data for IDs 1-10,000 between two servers, using minimum bandwidth, minimum number of connections. On the destination server the data will be a separate file for each id with the format:

<id>.dat

Something like this would be the long-winded solution:

dest:

files=`seq 1 10000`
for id in `echo $files`;
do
    ./t_show $id > $id
done
tar cf - $files | nice gzip -c  > dat.tar.gz

source:

scp user@source:dat.tar.gz ./
gunzip dat.tar.gz
tar xvf dat.tar

That is, write each output to its own file, compress & tar, send over network, extract.

It has the problem that I need to create a new file for each id. This takes up tonnes of space and doesn't scale well.

Is it possible to write the console output directly to a (compressed) tar archive without creating the intermediate files? Any better ideas (maybe writing compressed data directly across network, skipping tar)?

The tar archive would need to extract as I said on the destination server as a separate file for each ID.

Thanks to anyone who takes the time to help.


Solution

  • Thanks all

    I've taken the advice 'just send the data formatted in some way and parse it on the the receiver', it seems to be the consensus. Skipping tar and using ssh -C for simplicity.

    Perl script. Breaks the ids into groups of 1000. IDs are source_id in hash table. All data is sent via single ssh, delimited by 'HEADER', so it writes to the appropriate file. This is a lot more efficient:

    sub copy_tickserver_files {
    my $self = shift;
    
    my $cmd = 'cd tickserver/ ; ';
    
    my $i = 1;
    
    while ( my ($source_id, $dest_id) = each ( %{ $self->{id_translations} } ) ) {
        $cmd .= qq{ echo HEADER $source_id ; ./t_show $source_id ; };
        $i++;
        if ( $i % 1000 == 0 ) {
            $cmd = qq{ssh -C dba\@$self->{source_env}->{tickserver} " $cmd " | };
            $self->copy_tickserver_files_subset( $cmd );
            $cmd = 'cd tickserver/ ; ';
        }
    }
    
    $cmd = qq{ssh -C dba\@$self->{source_env}->{tickserver} " $cmd " | };
    $self->copy_tickserver_files_subset( $cmd );
    
    }
    
    sub copy_tickserver_files_subset {
    my $self = shift;
    my $cmd = shift;
    
    my $output = '';
    open TICKS, $cmd;
    while(<TICKS>) {
        if ( m{HEADER [ ] ([0-9]+) }mxs ) {
            my $id = $1;
            $output = "$self->{tmp_dir}/$id.ts";
            close TICKSOP;
            open TICKSOP, '>', $output;
            next;
        }
        next unless $output;
        print TICKSOP "$_";
    }
    close TICKS;
    close TICKSOP;
    }