I am developing a distributed system where in a server will distribute a huge task to clients who would process them and return the result.
Server has to accept huge files with size of the order of 20Gb.
Server has to split this file into smaller pieces and send the path to the clients who in turn would scp the file and process them.
I am using read
and write
to perform file splitting which is performing ridiculously slow.
Code
//fildes - Source File handle
//offset - The point from which the split to be made
//buffersize - How much to split
//This functions is called in a for loop
void chunkFile(int fildes, char* filePath, int client_id, unsigned long long* offset, int buffersize)
{
unsigned char* buffer = (unsigned char*) malloc( buffersize * sizeof(unsigned char) );
char* clientFileName = (char*)malloc( 1024 );
/* prepare client file name */
sprintf( clientFileName, "%s%d.txt",filePath, client_id);
ssize_t readcount = 0;
if( (readcount = pread64( fildes, buffer, buffersize, *offset ) ) < 0 )
{
/* error reading file */
printf("error reading file \n");
}
else
{
*offset = *offset + readcount;
//printf("Read %ud bytes\n And offset becomes %llu\n", readcount, *offset);
int clnfildes = open( clientFileName, O_CREAT | O_TRUNC | O_WRONLY , 0777);
if( clnfildes < 0 )
{
/* error opening client file */
}
else
{
if( write( clnfildes, buffer, readcount ) != readcount )
{
/* eror writing client file */
}
else
{
close( clnfildes );
}
}
}
free( buffer );
return;
}
I am using C++. I am ready to use other languages if they can perform faster.
You can place the file in the reach of a webserver and then use curl
from the clients
curl --range 10000-20000 http://the.server.ip/file.dat > result
would get 10000 bytes (from 10000 to 20000)
If the file is highly redundant and the network is slow probably using compression could help speeding up the transfer a lot. For example executing
nc -l -p 12345 | gunzip > chunk
on the client and then executing
dd skip=10000 count=10000 if=bigfile bs=1 | gzip | nc client.ip.address 12345
on the server you can transfer a section doing a gzip compression on the fly without the need of creating intermediate files.
A single command to get a section of a file from a server using compression over the network is
ssh server 'dd skip=10000 count=10000 bs=1 if=bigfile | gzip' | gunzip > chunk