Search code examples
hdfsnuodb

Nuodb and HDFS as storage


Using HDFS for Nuodb as storage. Would this have a performance impact?

If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?

On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.

How would Nuodb manage these kind of latency gotchas?


Solution

  • Good afternoon,

    My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.

    First... a mini lesson on NuoDB architecture/layout:

    The most basic NuoDB set-up includes:

    • Broker Agent
    • Transaction Engine (TE)
    • Storage Manager (SM) connected to an Archive Directory

    Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.

    Transaction Engines process incoming SQL requests and manage transactions.

    Storage Managers read and write data to and from "disk" (Archive Directory)

    All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.

    Ok, so now lets talk about Storage:

    This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:

    • Local Files System
    • Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
    • Hadoop Distributed Files System (HDFS)

    So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.

    And now to finally address the question of latency:

    Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.

    I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.

    Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.

    All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.

    Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.

    If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).

    Thank you, Elisabete

    Technical Support Engineer at NuoDB