Search code examples
hdf5hdfhdfql

HDFql Filling Dataset Iteratively from std::vector


I am trying to fill a dataset in an HDF5 file iteratively using HDFql. What I mean by iteratively, is that my simulator occasionally comes along with an update and I wish to dump some more data (which is contained in a std::vector) into my dataset. Weirdly though, something breaks after a few 'iterations' and my dataset begins to just fill with zeros.

Luckily, this error also occurs in a minimal example and seems to be reproducible with the below code:

#include <stdio.h>
#include <random>
#include <HDFql.hpp>

int main (int argc, const char * argv[]) {
    HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
    HDFql::execute("USE FILE /tmp/test_random.h5");
    HDFql::execute("CREATE GROUP data");
    HDFql::execute("CREATE CHUNKED DATASET data/vals AS SMALLINT(UNLIMITED)");
    HDFql::execute("CLOSE FILE");
    std::stringstream ss;
    std::random_device rd;
    std::mt19937 eng(rd());
    std::uniform_int_distribution<> dist_vals(0, 500);
    std::uniform_int_distribution<> dist_len(300, 1000);
    for(int i=0; i<500; i++)
    {
        const int num_values = dist_len(eng);
        std::vector<uint16_t> vals;
        for(int i=0; i<num_values; i++)
        {
            const int value = dist_vals(eng);
            vals.push_back(value);
        }
        HDFql::execute("USE FILE /tmp/test_random.h5");

        ss << "ALTER DIMENSION data/vals TO +" << vals.size();
        HDFql::execute(ss.str().c_str()); ss.str("");

        ss << "INSERT INTO data/vals(-" << vals.size() << ":1:1:" << vals.size() 
            << ") VALUES FROM MEMORY " 
            << HDFql::variableTransientRegister(vals.data());
        HDFql::execute(ss.str().c_str()); ss.str("");

        HDFql::execute("CLOSE FILE");
    }
}

This code runs for 500 'iterations', filling the data vector with a random amount of random data each time. In my latest run, everything beyond data cell 4065 in the final output hdf file was just zeros.

So my question is: what am I doing wrong here? Many thanks!

Edit

On further experimentation, I have come to the conclusion that this is possibly a bug in HDFql. Looking at the following example:

#include <stdio.h>
#include <random>
#include <HDFql.hpp>

int main (int argc, const char * argv[]) {
    HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
    HDFql::execute("USE FILE /tmp/test_random.h5");
    HDFql::execute("CREATE CHUNKED DATASET data/vals AS SMALLINT(0 TO UNLIMITED)");

    std::stringstream ss;
    std::random_device rd;
    std::mt19937 eng(rd());
    std::uniform_int_distribution<> dist_vals(0, 450);
    std::uniform_int_distribution<> dist_len(100, 300);
    int total_added = 0;

    for(int i=0; i<5000; i++)
    {
        const int num_values = 1024; //dist_len(eng);
        std::vector<uint16_t> vals;
        for(int j=0; j<num_values; j++)
        {
            const int value = dist_vals(eng);
            vals.push_back(value);
        }

        long long dim=0;
        ss << "SHOW DIMENSION data/vals INTO MEMORY " << HDFql::variableTransientRegister(&dim);
        HDFql::execute(ss.str().c_str()); ss.str("");

        ss << "ALTER DIMENSION data/vals TO +" << vals.size();
        HDFql::execute(ss.str().c_str()); ss.str("");

        ss << "INSERT INTO data/vals(-" << vals.size() << ":1:1:" << vals.size()
            << ") VALUES FROM MEMORY "
            << HDFql::variableTransientRegister(vals.data());
        HDFql::execute(ss.str().c_str()); ss.str("");

        total_added += vals.size();
        std::cout << i << ": "<<  ss.str() << ":  dim = " << dim
                << " : added = " << vals.size() << " (total="
                << total_added << ")" << std::endl;

    }

    HDFql::execute("CLOSE FILE");
}

This code keeps the size of the data constant at 1024 (num_values = 1024;) and should work fine. However, if this is changed to 1025, the bug appears and is evidenced by the console outputting:

....
235: :  dim = 240875 : added = 1025 (total=241900)
236: :  dim = 241900 : added = 1025 (total=242925)
237: :  dim = 0 : added = 1025 (total=243950)
238: :  dim = 0 : added = 1025 (total=244975)
239: :  dim = 0 : added = 1025 (total=246000)
....

Indicating that something breaks at iteration 470, since the dimension of the dataset is clearly not zero.

Weirdly, this does not explain why I was having this problem in the original example, since the size of the data array was capped to 500.


Solution

  • So I figured out where the problem is - in the following, the first example works and the second does not:

    Works

    #include <stdio.h>
    #include <random>
    #include <HDFql.hpp>
    
    int main (int argc, const char * argv[]) {
        int total_added = 0;
        std::random_device rd;
        std::mt19937 eng(rd());
        std::uniform_int_distribution<> dist_vals(0, 450);
        std::uniform_int_distribution<> dist_len(100, 300);
        const int fixed_buffer_size = 10000;
    
        HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
        HDFql::execute("USE FILE /tmp/test_random.h5");
        HDFql::execute("CREATE CHUNKED DATASET data/vals AS INT(0 TO UNLIMITED)");
    
        for(int i = 0; i < 5000; i++)
        {
            const int num_values = dist_len(eng);
            std::vector<int> vals(fixed_buffer_size);
            long long dim = 0;
            sprintf(script, "SHOW DIMENSION data/vals INTO MEMORY %d", HDFql::variableTransientRegister(&dim));
            HDFql::execute(script);
    
            sprintf(script, "ALTER DIMENSION data/vals TO +%d", num_values);
            HDFql::execute(script);
    
            for(int j=0; j<num_values; j++)
            {
                const int value = dist_vals(eng);
                vals.at(j) = value;
            }
            sprintf(script, "INSERT INTO data/vals(-%d:1:1:%d) VALUES FROM MEMORY %d", num_values, num_values, HDFql::variableTransientRegister(vals.data()));
            HDFql::execute(script);
            HDFql::execute("FLUSH");
    
            total_added += num_values;
            std::cout << i << ": " << ":  dim = " << dim << " : added = " << num_values << " (total=" << total_added << ")" << std::endl;
        }
    
        HDFql::execute("CLOSE FILE");
    }
    

    Fails

    #include <stdio.h>
    #include <random>
    #include <HDFql.hpp>
    
    int main (int argc, const char * argv[]) {
        int total_added = 0;
        std::random_device rd;
        std::mt19937 eng(rd());
        std::uniform_int_distribution<> dist_vals(0, 450);
        std::uniform_int_distribution<> dist_len(100, 300);
    
        HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
        HDFql::execute("USE FILE /tmp/test_random.h5");
        HDFql::execute("CREATE CHUNKED DATASET data/vals AS INT(0 TO UNLIMITED)");
    
        for(int i = 0; i < 5000; i++)
        {
            const int num_values = dist_len(eng);
            std::vector<int> vals(num_values);
            long long dim = 0;
            sprintf(script, "SHOW DIMENSION data/vals INTO MEMORY %d", HDFql::variableTransientRegister(&dim));
            HDFql::execute(script);
    
            sprintf(script, "ALTER DIMENSION data/vals TO +%d", num_values);
            HDFql::execute(script);
    
            for(int j=0; j<num_values; j++)
            {
                const int value = dist_vals(eng);
                vals.at(j) = value;
            }
            sprintf(script, "INSERT INTO data/vals(-%d:1:1:%d) VALUES FROM MEMORY %d", num_values, num_values, HDFql::variableTransientRegister(vals.data()));
            HDFql::execute(script);
            HDFql::execute("FLUSH");
    
            total_added += num_values;
            std::cout << i << ": " << ":  dim = " << dim << " : added = " << num_values << " (total=" << total_added << ")" << std::endl;
        }
    
        HDFql::execute("CLOSE FILE");
    }
    

    The only difference between the two is that in the first the size of the data buffer vals is fixed and that in the second the data buffer size is created dynamically and randomly.

    I don't understand why this error occurs, since in c++ std::vectors are supposed to have the underlying data lie contiguous in memory and be fully compatible with C arrays and pointer magic. But clearly the compiler is doing something different in each example. Anyways, I hope this helps anyone else with this issue - the solution is to use fixed size data buffers.