Search code examples
c++valgrindhdf5

Valgrind indicates function header as location of invalid write


In my application I have a function which reads data from a HDF5 file. valgrind reports many invalid reads and writes in it, the first of which is

 ==17899== Invalid write of size 8
 ==17899==    at 0x6BD617: SPopulation<NPersAgent>::readAgentDataQDF(long, long, long) (SPopulation.cpp:1695)
 ==17899==    by 0x7A02DC: PopReader::read(PopBase*, char const*, int, bool) (PopReader.cpp:164)
 ==17899==    by 0x1B358D: SimParams::setPops(long, char const*, bool) (SimParams.cpp:1386)
 ==17899==    by 0x1B3332: SimParams::setPops(char const*) (SimParams.cpp:1351)
 ==17899==    by 0x1B2FFC: SimParams::setPopList(char*) (SimParams.cpp:1294)
 ==17899==    by 0x1B057A: SimParams::readOptions(int, char**) (SimParams.cpp:488)
 ==17899==    by 0x1A02EB: main (QHGMain.cpp:67)
 ==17899==  Address 0x1ffedb5248 is on thread 1's stack
 ==17899==  in frame #0, created by SPopulation<NPersAgent>::readAgentDataQDF(long, long, long) (SPopulation.cpp:1695)

There are more than 70 additional invalid reads/writes following, all in the same method. Some of them are at the same place (with different addresses on thread 1's stack), others are in assignment lines like hsize_t iOffset = 0; or lines with function calls like compactData();. These strange location are listed after an invalid read located in HDF5 function:

herr_t status = H5Sget_simple_extent_dims(hDataSpace, &dims, NULL);

(dims has been set to 0 some lines before this)

The line "SPopulation.cpp:1695", is the head of a method:

template<typename T>
int  SPopulation<T>::readAgentDataQDF(hid_t hDataSpace, hid_t hDataSet, hid_t hAgentType) {

This completely stumps me - i don't see any reading or writing going on here...

hDataSpace and hDataSet are created from an opened HDF5 file:

hid_t hDataSet = H5Dopen2(m_hSpeciesGroup, AGENT_DATASET_NAME, H5P_DEFAULT);
hid_t hDataSpace = H5Dget_space(hDataSet);

hAgentType is created like this:

hid_t hAgentDataType = H5Tcreate (H5T_COMPOUND, agentRealSizeQDF());
T ta;
H5Tinsert(hAgentDataType, LIFE_STATE,  qoffsetof(ta, m_iLifeState), H5T_NATIVE_INT);
H5Tinsert(hAgentDataType, CELL_INDEX,  qoffsetof(ta, m_iCellIndex), H5T_NATIVE_INT);
H5Tinsert(hAgentDataType, AGENT_ID,    qoffsetof(ta, m_ulID),       H5T_NATIVE_LONG);
...

These values seem to be in order, as the program correctly reads the data from the file.

  • What does it mean when valgrind indicates a function header as the location of an invalid write?

  • How do i have to interpret an invalid read for a simple assignment?

Edit: Here's the function body:

1694:template<typename T>
1695:int  SPopulation<T>::readAgentDataQDF(hid_t hDataSpace, hid_t hDataSet, hid_t hAgentType) {
1696:
1697:    int iResult = 0;
1698:    T aBuf[ABUFSIZE];
1699:    hsize_t dims = 0;
1700:    herr_t status = H5Sget_simple_extent_dims(hDataSpace, &dims, NULL);
1701:    hsize_t iOffset = 0;
1702:    hsize_t iCount  = 0;
1703:    hsize_t iStride = 1;
1704:    hsize_t iBlock  = 1;
1705:
1706:    compactData();
1707:
1708:    updateTotal();
1709:
1710:    while ((iResult == 0) && (dims > 0)) {
1711:        if (dims > ABUFSIZE) {
1712:            iCount = ABUFSIZE;
1713:        } else {
1714:            iCount = dims;
1715:        }
1716:
1717:        // read a buffer full
1718:        hid_t hMemSpace = H5Screate_simple (1, &iCount, NULL); 
1719:        status = H5Sselect_hyperslab(hDataSpace, H5S_SELECT_SET, 
1720:                                     &iOffset, &iStride, &iCount, &iBlock);
1721:        status = H5Dread(hDataSet, hAgentType, hMemSpace,
1722:                     hDataSpace, H5P_DEFAULT, aBuf);
1723:        if (status >= 0) {
1724:
1725:            uint iFirstIndex = m_pAgentController->reserveSpace2((uint)iCount);
1726:            m_aAgents.copyBlock(iFirstIndex, aBuf, (uint)iCount);
1727:            for (uint j =0; j < iCount; j++) {
1728:                if (aBuf[j].m_ulID > m_iMaxID) {
1729:                    m_iMaxID = aBuf[j].m_ulID;
1730:                }
1731:            }
1732: 
1733:            dims -= iCount;
1734:            iOffset += iCount; 
1735:
1736:        } else {
1737:            iResult = -1;
1738:        }
1739:    }
1740:
1741:    updateTotal();
1742:
1743:    updateNumAgentsPerCell();
1744:
1745:    return iResult;
1746:}

The lines indicated by valgrind as erroneous are 1695, 1697, 1699, 1700, 1701, 1702, 1703, 1704, 1706, 1708, 1710, 1711, 1712, 1714, 1718, 1719, 1721, 1723, 1725, 1726, 1727, 1728, 1729, 1733, 1734, 1741, 1743, and 1745. That is, almost every single line except lines containing only brackets or } else {.

Edit2: code of PopReader around call to readAGentDataQDF Edit3: the correct fragment of PopReader

int PopReader::read(PopBase *pPB, const char *pSpeciesName, int iNumCells, bool bRestore) {
    int iResult = -1;    

    //    printf("reading data for [%s]\n", pSpeciesName);
    m_hSpeciesGroup = qdf_openGroup(m_hPopGroup, pSpeciesName);

    if (iResult == 0) {
        // set the handles
        hid_t hAgentType = pPB->getAgentQDFDataType();
        hid_t hDataSet = H5Dopen2(m_hSpeciesGroup, AGENT_DATASET_NAME, H5P_DEFAULT);
        hid_t hDataSpace = H5Dget_space(hDataSet);
            

        iResult = pPB->readAgentDataQDF(hDataSpace, hDataSet, hAgentType);
                if (iResult == 0) {
....

 

Solution

  • Ok, this is more a supposition than a reply, but it wouldn't fit a comment anyway.

    Looking at the provided code, it seems possible that those lines:

    hid_t hDataSet = H5Dopen2(m_hSpeciesGroup, AGENT_DATASET_NAME, H5P_DEFAULT);
    hid_t hDataSpace = H5Dget_space(hDataSet);
                
    
    iResult = pPB->readAgentDataQDF(hDataSpace, hDataSet, hAgentType);
    

    get implemented by the compiler somehow like this (optimization to remove useless locals):

            iResult = pPB->readAgentDataQDF(H5Dget_space(hDataSet), hDataSet = H5Dopen2(m_hSpeciesGroup, AGENT_DATASET_NAME, H5P_DEFAULT), hAgentType);
    

    Then, I suspect that H5Dopen2 is doing the wrongful write, and I suspect that there is no symbol for valgrind to get a better overview of it.

    So, if I were you, I would try to compile this code in the lowest optimization level as possible with debug symbols (including H5D library) (likely with -O0 -ggdb3 flags).

    If this is not possible, just add a return 0; at line 1700 and check if valgrind it shouting, if not try at line 1706 and so on for each function called in that method until it shouts.

    Also, you have an option in Valgrind to start/attach a GDB instance upon such error so you can inspect what the invalid write address refers to, but you'll need a non optimized build with debug information for this to make any sense. Run valgrind with --vgdb=full --vgdb-error=0 and run gdb /path/to/your/elf/file in another window then attach with target remote <insert what valgrind tells you here> in GDB console.

    EDIT: I've re-read valgrind's entire message, and I think I've missed one very important potential bug:

    ==17899==  Address 0x1ffedb5248 is on thread 1's stack
     ==17899==  in frame #0, created by SPopulation<NPersAgent>::readAgentDataQDF(long, long, long) (SPopulation.cpp:1695)
    

    And you have this line in the method:

    1698:    T aBuf[ABUFSIZE];
    

    I'm pretty sure ABUFSIZE * sizeof(T) is larger than your thread stack size. Then, upon entering the method, the compiler is pushing data on the stack for all the method's local but since the stack is too small, you end up with a... ahem stack overflow.

    Make ABUFSIZE smaller or enlarge your thread stack size or allocate on the heap and I think your issue will be gone.