Search code examples
rrcpp

Porting character buffers into Rcpp


I am trying to run C code in R using Rcpp, but am unsure how to convert a buffer used to hold data from a file. In the third line of code below, I allocate an unsigned char buffer and my problem is that I don't know what Rcpp data type to use. Once the data are read into the buffer, I figured out how to use Rcpp::NumericMatrix to hold the final result, but not the character buffer. I have seen several responses by Dirk Eddelbuettel to similar questions where he suggests replacing all 'malloc' calls with Rcpp initialization commands. I tried using an Rcpp::CharacterVector, but then there is a type mismatch in the loop at the end: the Rcpp::CharacterVector cannot be read as an unsigned long long int. The code runs for some C-compilers, but throws a 'memory corruption' error for others, so I would prefer to do things the way Dirk suggests (use Rcpp data types) so that the code will run regardless of the specific compiler.

    FILE *fp = fopen( filename, "r" );
    fseek( fp, index_data_offset, SEEK_SET );
    unsigned char* buf = (unsigned char *)malloc( 3 * number_of_index_entries * sizeof(unsigned long long int) );
    fread( buf, sizeof("unsigned long long int"), (long)(3 * number_of_index_entries), fp );
    fclose( fp );

    // Convert "buf" into a 3-column matrix.
    unsigned long long int l;
    Rcpp::NumericMatrix ToC(3, number_of_index_entries);
    for (int col=0; col<number_of_index_entries; col++ ) {
        l = 0;
        int offset = (col*3 + 0)*sizeof(unsigned long long int);
        for (int i = 0; i < 8; ++i) {
            l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
        }
        ToC(0,col) = l;

        l = 0;
        offset = (col*3 + 1)*sizeof(unsigned long long int);
        for (int i = 0; i < 8; ++i) {
            l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
        }
        ToC(1,col) = l;

        l = 0;
        offset = (col*3 + 2)*sizeof(unsigned long long int);
        for (int i = 0; i < 8; ++i) {
            l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
        }
        ToC(2,col) = l;
    }
    return( ToC );

Solution

  • C and C++ can be lovely. If you know what you're doing, you have both a very direct line to the underlying hardware and higher-level abstraction for efficient reasoning.

    I would suggest to simplify and reduce the problem. Start with a simple and known case, for example an STL vector of double. Let's call is x. Fill it with 10 or hundred elements, then open a FILE and write a blob from

    x.data(),  x.size() * sizeof(double)
    

    Close the file. The read it into Rcpp by first allocation a NumericVector v of the same size, then reading the bytes back and then calling memcpy to &(v[0]).

    It should be the same vector.

    Then you can generalize to different types. Because vectors are guaranteed to be contiguous memory you can this serialization trick directly.

    You can do variations on this with character buffers, or void*, or ... None of that matters for as long as you are careful not to mismatch. I.e. don't assing an int payload to a double and so on.

    Now, is any this recommended? Hell no, unless you are chasing performance and know well enough what you are doing in which case it is reasonable. Otherwise rely on fantastic existing packages like fst or qs to do it for you.

    I hope this helps with your question. I wasn't entirely what it was you were asking. Maybe you clarify (and possibly shorten / focus) it if not.