Search code examples
google-cloud-platformapache-flinkgoogle-cloud-dataprocdataproc

OSS supported by Google Cloud Dataproc


When I go to https://cloud.google.com/dataproc, I see this ...

"Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks."

But gcloud dataproc jobs submit doesn't list all of them. It lists only 8 (hadoop, hive, pig, presto, pyspark, spark, spark-r, spark-sql). Any idea why?

~ gcloud dataproc jobs submit
ERROR: (gcloud.dataproc.jobs.submit) Command name argument expected.

Available commands for gcloud dataproc jobs submit:

      hadoop                  Submit a Hadoop job to a cluster.
      hive                    Submit a Hive job to a cluster.
      pig                     Submit a Pig job to a cluster.
      presto                  Submit a Presto job to a cluster.
      pyspark                 Submit a PySpark job to a cluster.
      spark                   Submit a Spark job to a cluster.
      spark-r                 Submit a SparkR job to a cluster.
      spark-sql               Submit a Spark SQL job to a cluster.

For detailed information on this command and its flags, run:
  gcloud dataproc jobs submit --help

Solution

  • Some OSS components are offered as Dataproc Optional Components. Not of all them have a job submit API, some (e.g., Anaconda, Jupyter) don't need one, some (e.g., Flink, Druid) might add in the future.

    Some other OSS components are offered as libraries, e.g., GCS connector, BigQuery connector, Apache Parquet.