Search code examples
algorithmdata-structuressavelarge-data-volumes

I have 100 trillion elements, each of them has size from 1 byte to 1 trillion bytes (0.909 TiB). How to store them and access them very efficiently?


This is an interview question :

Suppose: I have 100 trillion elements, each of them has size from 1 byte to 1 trillion bytes (0.909 TiB). How to store them and access them very efficiently ?

My ideas : They want to test the knowledge about handling large volume of data efficiently. It is not an only-one-correct-answer question.

Save them into some special data structure ?

Actually I have no ideas about this kind of open-end question.

Any help is really appreciated.


Solution

  • It really depends on the data-set in question. I think the point is for you to discuss the alternatives and describe the various pros/cons.

    Perhaps you should answer their question with more questions!

    • How will it need to be accessed? (sequentially, randomly, some predictable distribution?)
    • Is the order of elements important?
    • Will the size of elements change?
    • How important is insert/remove performance?

    The data structure you choose will depend on what kinds of trade-offs you are willing to make.

    For example, if you only ever need to iterate over the set sequentially, perhaps you could should use a linked list as this it has a relatively small storage overhead.

    If instead you need random access, you might want to look into:

    • Hash-tables (constant time lookup, but need a good hash function for the data)
    • Some kind of index / tree structure?
    • Caching! You probably won't be able to keep it all in memory - and even if you can you want to take advantage of data locality where possible.

    TL;DR: It's all problem dependent. There are many alternatives.

    This is essentially the same problem faced by file systems / databases.