Search code examples
c++read-write

Split large files


I am developing a distributed system where in a server will distribute a huge task to clients who would process them and return the result.
Server has to accept huge files with size of the order of 20Gb.

Server has to split this file into smaller pieces and send the path to the clients who in turn would scp the file and process them.

I am using read and write to perform file splitting which is performing ridiculously slow.

Code

//fildes - Source File handle
//offset - The point from which the split to be made  
//buffersize - How much to split  

//This functions is called in a for loop   

void chunkFile(int fildes, char* filePath, int client_id, unsigned long long* offset, int buffersize) 
{
    unsigned char* buffer = (unsigned char*) malloc( buffersize * sizeof(unsigned char) );
    char* clientFileName = (char*)malloc( 1024 );
    /* prepare client file name */
    sprintf( clientFileName, "%s%d.txt",filePath, client_id);

    ssize_t readcount = 0;
    if( (readcount = pread64( fildes, buffer, buffersize, *offset ) ) < 0 ) 
    {
            /* error reading file */
            printf("error reading file \n");
    } 
    else 
    {
            *offset = *offset + readcount;
            //printf("Read %ud bytes\n And offset becomes %llu\n", readcount, *offset);
            int clnfildes = open( clientFileName, O_CREAT | O_TRUNC | O_WRONLY , 0777);

            if( clnfildes < 0 ) 
            {
                    /* error opening client file */
            } 
            else 
            {
                    if( write( clnfildes, buffer, readcount ) != readcount ) 
                    {
                            /* eror writing client file */
                    } 
                    else 
                    {
                            close( clnfildes );
                    }
            }
    }

    free( buffer );
    return;
}  
  1. Is there any faster way to split files?
  2. Is there any way client can access its chunk in the file without using scp (read without transfer)?

I am using C++. I am ready to use other languages if they can perform faster.


Solution

  • You can place the file in the reach of a webserver and then use curl from the clients

    curl --range 10000-20000 http://the.server.ip/file.dat > result
    

    would get 10000 bytes (from 10000 to 20000)

    If the file is highly redundant and the network is slow probably using compression could help speeding up the transfer a lot. For example executing

    nc -l -p 12345 | gunzip > chunk
    

    on the client and then executing

    dd skip=10000 count=10000 if=bigfile bs=1 | gzip | nc client.ip.address 12345
    

    on the server you can transfer a section doing a gzip compression on the fly without the need of creating intermediate files.

    EDIT

    A single command to get a section of a file from a server using compression over the network is

    ssh server 'dd skip=10000 count=10000 bs=1 if=bigfile | gzip' | gunzip > chunk