Search code examples
databasegitdatabase-performancedatabase-replicationdocument-database

Using git repository as a database backend


I'm doing a project that deals with structured document database. I have a tree of categories (~1000 categories, up to ~50 categories on each level), each category contains several thousands (up to, say, ~10000) of structured documents. Each document is several kilobytes of data in some structured form (I'd prefer YAML, but it may just as well be JSON or XML).

Users of this systems do several types of operations:

  • retrieving of these documents by ID
  • searching for documents by some of the structured attributes inside them
  • editing documents (i.e. adding/removing/renaming/merging); each edit operation should be recorded as a transaction with some comment
  • viewing a history of recorded changes for particular document (including viewing who, when and why changed the document, getting earlier version - and probably reverting to this one if requested)

Of course, the traditional solution would be using some sort of document database (such as CouchDB or Mongo) for this problem - however, this version control (history) thing tempted me to a wild idea - why shouldn't I use git repository as a database backend for this application?

On the first glance, it could be solved like this:

  • category = directory, document = file
  • getting document by ID => changing directories + reading a file in a working copy
  • editing documents with edit comments => making commits by various users + storing commit messages
  • history => normal git log and retrieval of older transactions
  • search => that's a slightly trickier part, I guess it would require periodic export of a category into relational database with indexing of columns that we'll allow to search by

Are there any other common pitfalls in this solution? Have anyone tried to implement such backend already (i.e. for any popular frameworks - RoR, node.js, Django, CakePHP)? Does this solution have any possible implications on performance or reliability - i.e. is it proven that git would be much slower than traditional database solutions or there would be any scalability/reliability pitfalls? I presume that a cluster of such servers that push/pull each other's repository should be fairly robust & reliable.

Basically, tell me if this solution will work and why it will or won't do?


Solution

  • Answering my own question is not the best thing to do, but, as I ultimately dropped the idea, I'd like to share on the rationale that worked in my case. I'd like to emphasize that this rationale might not apply to all cases, so it's up to architect to decide.

    Generally, the first main point my question misses is that I'm dealing with multi-user system that work in parallel, concurrently, using my server with a thin client (i.e. just a web browser). This way, I have to maintain state for all of them. There are several approaches to this one, but all of them are either too hard on resources or too complex to implement (and thus kind of kill the original purpose of offloading all the hard implementation stuff to git in the first place):

    • "Blunt" approach: 1 user = 1 state = 1 full working copy of a repository that server maintains for user. Even if we're talking about fairly small document database (for example, 100s MiBs) with ~100K of users, maintaining full repository clone for all of them makes disc usage run through the roof (i.e. 100K of users times 100MiB ~ 10 TiB). What's even worse, cloning 100 MiB repository each time takes several seconds of time, even if done in fairly effective maneer (i.e. not using by git and unpacking-repacking stuff), which is non acceptable, IMO. And even worse — every edit that we apply to a main tree should be pulled to every user's repository, which is (1) resource hog, (2) might lead to unresolved edit conflicts in general case.

      Basically, it might be as bad as O(number of edits × data × number of users) in terms of disc usage, and such disc usage automatically means pretty high CPU usage.

    • "Only active users" approach: maintain working copy only for active users. This way, you generally store not a full-repo-clone-per-user, but:

      • As user logs in, you clone the repository. It takes several seconds and ~100 MiB of disc space per active user.
      • As user continues to work on the site, he works with the given working copy.
      • As user logs out, his repository clone is copied back to main repository as a branch, thus storing only his "unapplied changes", if there are any, which is fairly space-efficient.

      Thus, disc usage in this case peaks at O(number of edits × data × number of active users), which is usually ~100..1000 times less than number of total users, but it makes logging in/out more complicated and slower, as it involves cloning of a per-user branch on every login and pulling these changes back on logout or session expiration (which should be done transactionally => adds another layer of complexity). In absolute numbers, it drops 10 TiBs of disc usage down to 10..100 GiBs in my case, that might be acceptable, but, yet again, we're now talking about fairly small database of 100 MiBs.

    • "Sparse checkout" approach: making "sparse checkout" instead of full-blown repo clone per active user doesn't help a lot. It might save ~10x of disc space usage, but at expense of much higher CPU/disc load on history-involving operations, which kind of kills the purpose.

    • "Workers pool" approach: instead of doing full-blown clones every time for active person, we might keep a pool of "worker" clones, ready to be used. This way, every time a users logs in, he occupies one "worker", pulling there his branch from main repo, and, as he logs out, he frees the "worker", which does clever git hard reset to become yet again just a main repo clone, ready to be used by another user logging in. Does not help much with disc usage (it's still pretty high — only full clone per active user), but at least it makes logging in/out faster, as expense of even more complexity.

    That said, note that I intentionally calculated numbers of fairly small database and user base: 100K users, 1K active users, 100 MiBs total database + history of edits, 10 MiBs of working copy. If you'd look at more prominent crowd-sourcing projects, there are much higher numbers there:

    │              │ Users │ Active users │ DB+edits │ DB only │
    ├──────────────┼───────┼──────────────┼──────────┼─────────┤
    │ MusicBrainz  │  1.2M │     1K/week  │   30 GiB │  20 GiB │
    │ en.wikipedia │ 21.5M │   133K/month │    3 TiB │  44 GiB │
    │ OSM          │  1.7M │    21K/month │  726 GiB │ 480 GiB │
    

    Obviously, for that amounts of data/activity, this approach would be utterly unacceptable.

    Generally, it would have worked, if one could use web browser as a "thick" client, i.e. issuing git operations and storing pretty much the full checkout on client's side, not on the server's side.

    There are also other points that I've missed, but they're not that bad compared to the first one:

    • The very pattern of having "thick" user's edit state is controversial in terms of normal ORMs, such as ActiveRecord, Hibernate, DataMapper, Tower, etc.
    • As much as I've searched for, there's zero existing free codebase for doing that approach to git from popular frameworks.
    • There is at least one service that somehow manages to do that efficiently — that is obviously github — but, alas, their codebase is closed source and I strongly suspect that they do not use normal git servers / repo storage techniques inside, i.e. they basically implemented alternative "big data" git.

    So, bottom line: it is possible, but for most current usecases it won't be anywhere near the optimal solution. Rolling up your own document-edit-history-to-SQL implementation or trying to use any existing document database would be probably a better alternative.