Search code examples
python-3.xpysparkdatabricksdatabricks-connect

Using Pyspark locally when installed using databricks-connect


I have databricks-connect 6.6.0 installed, which has a Spark version 2.4.6. I have been using the databricks cluster till now, but I am trying to switch to using a local spark session for unit testing. However, every time I run it, it still shows up on the cluster Spark UI as well as the local Spark UI on xxxxxx:4040.

I have tried initiating using SparkConf(), SparkContext(), and SQLContext() but they all do the same thing. I have also set the right SPARK_HOME, HADOOP_HOME, and JAVA_HOME, and downloaded winutils.exe separately, and none of these directories have spaces. I have also tried running it from console as well as from terminal using spark-submit.

This is one of the pieces of sample code I tried:

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("name").getOrCreate()
inp = spark.createDataFrame([('Person1',12),('Person2',14)],['person','age'])
op = inp.toPandas()

I am using: Windows 10, databricks-connect 6.6.0, Spark 2.4.6, JDK 1.8.0_265, Python 3.7, PyCharm Community 2020.1.1

Do I have to override the default/global spark session to initiate a local one? How would I do that? I might be missing something - The code itself runs fine, it's just a matter of local vs. cluster.

TIA


Solution

  • You can’t run them side by side. I recommend having two virtual environments using Conda. One for databricks-connect one for pyspark. Then just switch between the two as needed.