Search code examples
google-cloud-platformgoogle-cloud-dataprocdruid

GCP Dataproc has Druid available in alpha. How to load segments?


The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some google specific references to a bucket, but there are no examples how to do this.

What is the method to load data into Druid, running on GCP dataproc straight out of the box?


Solution

  • I haven't used Dataproc version of Druid, but have a small cluster running in Google Compute VM. The way I ingest data to it from GCS is by using Google Cloud Storage Druid extension - https://druid.apache.org/docs/latest/development/extensions-core/google.html

    To enable extension you need to add it to a list of extension in your Druid common.properties file:

    druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]
    

    To ingest data from GCS I send HTTP POST request to http://druid-overlord-host:8081/druid/indexer/v1/task

    The POST request body contains JSON file with ingestion spec(see ["ioConfig"]["firehose"] section):

    {
        "type": "index_parallel",
        "spec": {
            "dataSchema": {
                "dataSource": "daily_xport_test",
                "granularitySpec": {
                    "type": "uniform",
                    "segmentGranularity": "MONTH",
                    "queryGranularity": "NONE",
                    "rollup": false
                },
                "parser": {
                    "type": "string",
                    "parseSpec": {
                        "format": "json",
                        "timestampSpec": {
                            "column": "dateday",
                            "format": "auto"
                        },
                        "dimensionsSpec": {
                            "dimensions": [{
                                    "type": "string",
                                    "name": "id",
                                    "createBitmapIndex": true
                                },
                                {
                                    "type": "long",
                                    "name": "clicks_count_total"
                                },
                                {
                                    "type": "long",
                                    "name": "ctr"
                                },
                                "deleted",
                                "device_type",
                                "target_url"
                            ]
                        }
                    }
                }
            },
            "ioConfig": {
                "type": "index_parallel",
                "firehose": {
                    "type": "static-google-blobstore",
                    "blobs": [{
                        "bucket": "data-test",
                        "path": "/sample_data/daily_export_18092019/000000000000.json.gz"
                    }],
                    "filter": "*.json.gz$"
                },
                "appendToExisting": false
            },
            "tuningConfig": {
                "type": "index_parallel",
                "maxNumSubTasks": 1,
                "maxRowsInMemory": 1000000,
                "pushTimeout": 0,
                "maxRetry": 3,
                "taskStatusCheckPeriodMs": 1000,
                "chatHandlerTimeout": "PT10S",
                "chatHandlerNumRetries": 5
            }
        }
    }
    

    Example cURL command to start ingestion task in Druid(spec.json contains JSON from the previous section):

    curl -X 'POST' -H 'Content-Type:application/json' -d @spec.json http://druid-overlord-host:8081/druid/indexer/v1/task