amazon-s3 apache-kafka apache-kafka-connect aws-glue data-lake

How to create a Datalake using Apache Kafka, Amazon Glue and Amazon S3?

I want to store all the data from a Kafka's topic into Amazon S3. I have a Kafka cluster that receives in one topic 200.000 messages per second, and each value message has 50 fields (strings, timestamps, integers, and floats).

My main idea is to use Kafka Connector to store the data in a bucket s3 and after that use Amazon Glue to transform the data and keep it into another bucket. I have the next questions:

1) How to do it? That architecture will work well? I tried with Amazon EMR (Spark Streaming) but I had too many concerns How to decrease the processing time and failed tasks using Apache Spark for events streaming from Apache Kafka?

2) I tried to use Kafka Connect from Confluent, but I have a few questions:

Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
What means this error "ERROR Task s3-sink-0 threw an uncaught an
unrecoverable exception"?

ERROR Task s3-sink-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142) java.lang.NullPointerException at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:122) at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290) at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:26,086] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:143) [2018-10-05 15:32:27,980] WARN could not create Dir using directory from url file:/targ. skipping. (org.reflections.Reflections:104) java.lang.NullPointerException at org.reflections.vfs.Vfs$DefaultUrlTypes$3.matches(Vfs.java:239) at org.reflections.vfs.Vfs.fromURL(Vfs.java:98) at org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at org.reflections.Reflections.scan(Reflections.java:237) at org.reflections.Reflections.scan(Reflections.java:204) at org.reflections.Reflections.(Reflections.java:129) at org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268) at org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377) at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:27,981] WARN could not create Vfs.Dir from url. ignoring the exception and continuing (org.reflections.Reflections:208) org.reflections.ReflectionsException: could not create Vfs.Dir from url, no matching UrlType was found [file:/targ] either use fromURL(final URL url, final List urlTypes) or use the static setDefaultURLTypes(final List urlTypes) or addDefaultURLTypes(UrlType urlType) with your specialized UrlType. at org.reflections.vfs.Vfs.fromURL(Vfs.java:109) at org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at org.reflections.Reflections.scan(Reflections.java:237) at org.reflections.Reflections.scan(Reflections.java:204) at org.reflections.Reflections.(Reflections.java:129) at org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268) at org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377) at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:35,441] INFO Reflections took 12393 ms to scan 429 urls, producing 13521 keys and 95814 values (org.reflections.Reflections:229)

If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
What means all these fields key.converter, value.converter, key.converter.schemas.enable, value.converter.schemas.enable, internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable?

What are the possible values for key.converter, value.converter?

3) Once my raw data is in a bucket, I would like to use Amazon Glue to take these data, to deserialize Protobuffer, to change the format of some fields, and finally to store it in another bucket in Parquet. How can I use my own java protobuffer library in Amazon Glue?

4) If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?

Solution

To complement @cricket_007's answer

Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?

Kafka S3 Connector is part of the Confluent distribution, which also includes Kafka, as well as other related services, but it is not meant to run on your brokers directly, rather:

as a standalone worker running a Connector's configuration given when the service is launched
or as an additional workers' cluster running on the side of your Kafka Brokers' cluster. In that case, interaction/running of connectors is better via the Kafka Connect REST API (Search for "Managing Kafka Connectors" for documentation with examples)

If you can resume the steps to connect to Kafka and keep on s3 from another Kafka instance, how will you do?

Are you talking about another Kafka Connect instance?

if so, you can simply execute the Kafka Connect service in distributed mode which was meant to give the reliability you seem to be looking for...

Or do you mean another Kafka (brokers) cluster?

in that case, you could try (but that would be experimental, and I haven't tried it myself...) to run Kafka Connect in standalone mode and simply update bootstrap.servers parameter of your connector's configuration to point to the new cluster. Why that might work: in standalone mode the offsets of your sink connector(s) are stored locally on your worker (contrarily to distributed mode where the offsets are stored on the Kafka cluster directly...). Why that might not work: it's simply not intended for this use and I'm guessing you might need your topics and partitions to be exactly the same...?

What are the possible values for key.converter, value.converter?

Check Confluent's documentation for kafka-connect-s3 ;)

How can I use my own java protobuffer library in Amazon Glue?

Not sure of the actual method, but Glue jobs spawn off an EMR cluster behind the scenes so I don't see why it shouldn't be possible...

If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?

Yes.

Assuming a daily partitioning, you could actually have you're schedule run the crawler first thing in the morning, as soon as you can expect new data to have created that day's folder on S3 (so at least one object for that day exists on S3)... The crawler will add that day's partition which will then be available for querying with any newly added object.