Search code examples
csvbigtablegoogle-cloud-bigtable

Bigtable CSV import


I have a large csv dataset (>5TB) in multiple files (stored in a storage bucket) that I need to import into Google Bigtable. The files are in the format:

rowkey,s1,s2,s3,s4
text,int,int,int,int
...

There is an importtsv function with hbase that would be perfect but this does not seem to be available when using Google hbase shell in windows. Is it possible to use this tool? If not, what is the fastest way of achieving this? I have little experience with hbase and Google Cloud so a simple example would be great. I have seen some similar examples using DataFlow but would prefer not to learn how to do this unless necessary.

Thanks


Solution

  • The ideal way to import something this large into Cloud Bigtable is to put your TSV on Google Cloud Storage.

    • gsutil mb <your-bucket-name>
    • gsutil -m cp -r <source dir> gs://<your-bucket-name>/

    Then use Cloud Dataflow.

    1. Use the HBase shell to create the table, Column Family, and the output columns.

    2. Write a small Dataflow job to read all the files, then create a key, followed by writing the table. (See this example to get started.)

    A bit easier way would be to: (Note- untested)

    • Copy your files to Google Cloud Storage
    • Use Google Cloud Dataproc the example shows how to create a cluster and hookup Cloud Bigtable.
    • ssh to your cluster master - the script in the wordcount-mapreduce example will accept ./cluster ssh
    • Use the HBase TSV importer to start a Map Reduce job.

      hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> gs://<your-bucket-name>/<dir>/**