Search code examples
apache-sparkhadooppysparkvora

Does SAP Vora 2.1 need a Hadoop / Spark cluster? And can PySpark be used?


As per the documentation SAP_Vora_Installation_Admin_Guide_2.0_en.pdf, it is required to have an Hadoop / Spark cluster running and a Kubernetes cluster running as well.

Now my question is, why do you need this Hadoop / Spark cluster? Because SAP Vora can read from HDFS, WebHDFS and so on.

So is this just that if you have a Spark job you can just run it on Spark cluster and if it needs data from HANA / Vora it can access it? Or does Vora also use the Spark cluster to process data on?

Because right now it looks like Spark can use Vora but not that Vora can use Spark (the Vora UI tools like the SQL Editor and so on). Because the Zeppelin that you can attach to Vora is just used for visualization (as I understand it, correct me if I am wrong please).

My second question is if it possible to use PySpark on the Hadoop / Spark cluster to interact with Vora and not just Scala Spark.

Thanks in advance.


Solution

  • Yes, your assumption is correct: Spark can access Vora 2.1 but Vora 2.1 cannot interact with Spark and therefore does not require a Hadoop/Spark cluster to be available. However, if you do not have Hadoop then you must have an alternative data store to load data from e.g. S3, ADL.

    Yes, it is possible to use PySpark to interact with Vora.