I am developing a recommendation engine. I think I can’t keep the whole similarity matrix in memory. I calculated similarities of 10,000 items and it is over 40 million float numbers. I stored them in a binary file and it becomes 160 MB.
Wow! The problem is that I could have nearly 200,000 items. Even if I cluster them into several groups and created similarity matrix for each group, then I still have to load them into memory at some point. But it will consume a lot memory.
So, is there anyway to deal with these data?
How should I stored them and load into the memory while ensuring my engine respond reasonably fast to an input?
You could use memory mapping to access your data. This way you can view your data on disk as one big memory area (and access it just as you would access memory) with the difference that only pages where you read or write data are (temporary) loaded in memory.
If you can group the data somewhat, only smaller portions would have to be read in memory while accessing the data.
As for the floats, if you could do with less resolution and store the values in say 16 bit integers, that would also half the size.