Search code examples
accumulo

Importing data from a previous version of Accumulo


I have a Docker-Compose stack on which there are instances of Accumulo version 1.9.3. On another machine I have configured an identical stack, but with updated versions of the various applications, including Accumulo 2.0.1.

In the first stack, Accumulo stores data in the /data/hdfs directory. I copied its contents and brought it to the new stack at the same path and would like to import that data into Accumulo to see if the new version 2.0.1 can interpret it correctly. At the moment the data seems not to be hooked, because Accumulo Monitor does not see any tables. Is there any way to make that data visible to Accumulo?


Solution

  • Apache Accumulo uses metadata to track information about the data stored in the files. It is usually insufficient to merely copy the underlying files in HDFS over. You will also need to migrate information about the files, or copy the data in a way that the metadata is largely irrelevant.

    I'll talk about metadata-irrelevant situation first. Accumulo stores its files in a format called RFile (extension .rf). The client code has a bulk import API to import a directory of such RFiles. There is also a command in the Accumulo shell to do it as well. If you already have a directory full of files, you can just create the table in the new instance, and use this import command to bulk add these files to your new table.

    There are several pitfalls to watch out for with the bulk import method of migrating:

    1. Accumulo will move the files into place into its own directories. You should not place them yourself into Accumulo's directories. You should instead place them in a different directory on the same volume.
    2. You should be careful to not bulk load too many files at once, as this could stall while Accumulo processes it, and you'll have to start all over from the beginning if things fail. Instead, you should consider importing in smaller batches, depending on how many files you have.
    3. This will move the entire contents of every file in the directory into the destination table. However, these files might contain data that was deleted, or that falls outside of the range of tablet boundaries in the original table. This can be a problem, because you could reintroduce deleted data, or otherwise corrupt your table, since you haven't copied the metadata over along with the files. In order to ensure that this method works well, you should make sure that your table is no longer ingesting any data, you first effectively disable automatic splits (by configuring an absurdly large split threshold), and then ensure the table is fully flushed and you have completed a full major compaction on the source table. Only then can you be sure that the import is going to work as intended without bringing in deleted data or mangling anything, due to files containing data outside their tablet's boundary.
    4. You should also be careful not to grab any files from the table's HDFS data directories that aren't referenced in the metadata table. So, make sure you let the accumulo-gc process finish doing its collection of these files, and then list the files and compare to the entries in the metadata table to ensure that they match 1-to-1, so you're not importing any orphaned files.
    5. You should also be aware that Accumulo's bulk import is only going to move files, so if you want to abort the process and start over, you should make copies of the data you're importing first, and import from the copies, especially if you're unfamiliar with the process.

    Accumulo also provides a convenient export table feature that avoids a lot of this complexity. That feature creates a listing of files for you to copy from the original table, and also creates a dump of the table's metadata. There is a corresponding import table feature that helps you import the files and the metadata on the receiving end. I believe there is an Accumulo shell command for this as well. Using this feature allows you to avoid doing all the compactions and checking HDFS for orphaned files, as it gives you the list of files to copy over, and recreates the metadata for you to create the tablet boundaries. You will still need to flush the table and probably take it offline (which will ensure ingest and splitting is halted), but the command itself should check for those prerequisites to help you through the process.

    Also, please note that the export/import table feature may not work fully with multiple volumes yet in released versions of Accumulo, so you'll need to take that into consideration if that situation applies to you.

    Also, please be aware that the latest version of Accumulo as of the time of this writing is 2.1, which is a long-term maintenance release (LTM). 2.0 versions are non-LTM and are not expected to receive any updates (or rather, its updates have been rolled into 2.1 instead). So, if you're setting up a new cluster, I would strongly advise against using 2.0, which is the version in your initial question, and choosing the latest 2.1 instead.

    If you have any follow-up questions or need help, the best place to get answers is on the documentation on the Accumulo website or via the mailing list (especially the user mailing list), which you can find at the project website.