Search code examples
apache-sparkapache-kudu

What does "avoid multiple Kudu clients per cluster" mean?


I am looking at kudu's documentation.

Below is a partial description of kudu-spark.

https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster

Avoid multiple Kudu clients per cluster.

One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.

Does this mean that I can only run one kudu-spark task at a time?

If I have a spark-streaming program that is always writing data to the kudu, How can I connect to kudu with other spark programs?


Solution

  • The more clear statement of "avoid multiple Kudu clients per cluster" is "avoid multiple Kudu clients per spark application".

    Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.