I am building an application based on distributed linear algebra using Trilinos, the main issue is that memory consumption is much higher than expected.
I have built a simple test case for building an Epetra::VbrMatrix with 1.5 million doubles grouped as 5 millions blocks of 3 doubles, which should be about 115MB.
After building the matrix on 2 processors, half data each, I get a memory consumption of 500MB on each processor, which is about 7.5 times the data, it looks unreasonable to me, the matrix should just have some integer arrays for locating the nonzero blocks.
I asked on the trilinos-users mailing list, they say memory usage looks too high, but hope to have some more help here.
I tested both on my laptop with Ubuntu + gcc 4.4.5 + Trilinos 10.0 and on a cluster with PGI compiler and Trilinos 10.4.0, the result is about the same.
My test code is on gist https://gist.github.com/848310, where I also wrote memory consumption at different stage in my testing with 2 MPI processes on my laptop.
If anybody has any suggestion that would be really helpful. Also if you could even just build, run and report memory consumption it would be great.
answer by Alan Williams form the trilinos-users list, in short VBRmatrix is not suitable for such small blocks, as the storage overhead is bigger than the data themselves:
The VbrMatrix storage format definitely incurs some storage overhead, as compared to the simple number of double-precision values being stored.
In your program, you are storing 5,000,000 X 1 X 3 == 15 million doubles. With 8 bytes per double, that is 120 million bytes.
The matrix class Epetra_VbrMatrix (which is the base class for Epetra_FEVbrMatrix) internally stores a Epetra_CrsGraph object, which represents the sparsity structure. This requires a couple of integers per block-row, and 1 integer per block-nonzero. (Your case has 5 million block-rows with 1 block-nonzero per row, so at least 15 million integers in total.)
Additionally, the Epetra_VbrMatrix class stores a Epetra_SerialDenseMatrix object for each block-nonzero. This adds a couple of integers, plus a bool and a pointer, for each block-nonzero. In your case, since your block-nonzeros are small (1x3 doubles), this is a substantial overhead. The VbrMatrix format has proportionally less overhead the bigger your block-nonzeros are. But in your case, with 1x3 blocks, the VbrMatrix is indeed occupying several times more memory than is required for the 15million doubles.