I couldn't find any resources on this topic. There are a few questions with good answers describing solutions to problems which call for data stored on disk (pickle, shelve, databases in general), but I want to learn how to implement my own.
1) If I were to create a disk based graph structure in Python, I'd have to implement the necessary methods by writing to disk. But how do I do that?
2) One of the benefits on disk based structures is having the efficiency of the structure while working with data that might not all fit on memory. If the data does not fit in memory, only some parts of it are accessed at once. How does one access only part of the structure at once?
There's quite a number of problems you have to solve, some are quite straight forward and some are a little bit more elaborate, but since you want to do it yourself I don't think you minding about filling out details yourself (so I'll skip some parts).
First simple step is to serialize and deserialize nodes (in order to be able to store on disk at all). That could be done in an ad hoc manner by having your nodes having an serialize
/deserialize
method - in addition you might want to have the serialized data to have an type indicator so you can know which class' deserialize
you should use to deserialize data. Note that on disk representation of a node must reference other nodes by file offset (either directly or indirectly).
The actual reading or writing of the data is done by ordinary (binary) file operations, but you have to seek to the right position in the file first.
Second step is to have the possibility to allocate space in the file. If you only want to have a write-once-behaviour it's quiete forward to just grow the file, but if you want to modify the data in the file (adding and removing nodes or even replacing them) you will have to cope with situation where regions in the file that are no longer in use and either reuse these or even pack the layout of the file.
Further steps could involve making the update atomic in some sense. One solution is to have a region where you write enough information so that the update can be completed (or abandoned) if it were terminated prematurely in it's most simple form it might just be a list of indempotent operations (operation that yields the same result if you repeat them, fx writing particular data to a particular place in the file).
Note that while (some of) the builtin solutions does indeed handle writing and reading the entire graph to/from disk they do not really handle the situation where you want to read only part of the graph or modifying the graph very efficient (you have to read mostly the whole graph and writing the complete graph in one go). Databases are the exception where you may read/write smaller parts of your data in a random manner.