Search code examples
pythonlistnlppickleshelve

Ways to store and access large (~10 GB) lists in Python?


I have a large set of strings that I'm using for natural language processing research, and I'd like a nice way to store it in Python.

I could use pickle, but loading the entire list into memory would then be an impossibility (I believe), as it's about 10 GB large, and I don't have that much main memory. Currently I have the list stored with the shelve library... The shelf is indexed by strings, "0", "1", ..., "n" which is a bit clunky.

Are there nicer ways to store such an object in a single file, and still have random (ish) access to it?

It may be that the best option is to split it into multiple lists.

Thanks!


Solution

  • Depending upon how you intend to get at the data, SQLite3 might be the best approach. SQLite3 is excellent at random access to relational data, but if your data is not very relational, it might not make as much sense. (Even if all your have is an 'id' number and then your string, I think SQLite3 for underlying storage of your strings might be great.)

    If you can figure out some mechanism to group together your strings by some way that you'd use them (say, if some of your sentences have implied objects or subjects, and you'd like to do research on them specifically; or depending upon the source of your strings, whether it be formal or informal or hyperinformal) or something like that, then you could reduce the 'working set' of your data significantly by partitioning it, and potentially drastically improving throughput of your research. But if you intend on truly random access then one big pile might be best.

    Hope this helps.