Apache Spark : Reading file in Standalone cluster mode

I am currently using a graph that i load from a file when i run my Graphx application locally.

I'd like to run the application in cluster standalone mode.

Do I have to make changes like place the file in each cluster node? Can I leave my application unchanged and just keep the file in the driver?

Thank you.

Solution

In order to allow the executors on the node to access an input file, the file should be access by the nodes.

The preferred way is to read the file from a location which support multi nodes, e.g. HDFS, cassandra

It is possible that placing a copy of the file on each node might work as well, but it isn't the recommended way.

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?
Assign a variable a dynamic value in SQL in Databricks / Spark
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
PySpark Window functions: Aggregation differs if WindowSpec has sorting
How to monitor Apache Spark with Prometheus?
SparkSQL JDBC (PySpark) to Postgres - Creating Tables and Using CTEs
How to replace accented characters in PySpark?
Reading / Fixing a corrupt parquet file
How do I access the fields within a VARIANT column while reading from Kafka using Spark?
Spark "storage partitioned join" (SPJ)
LEFT and RIGHT function in PySpark SQL
Service account cannot get resource \"sparkapplications/status\" in API group \"sparkoperator.k8s.io\"
I need to calculate profit/loss for given stock data set, ensuring that the first bought items are sold first
Spark context jarOfClass method is returning None.get error
Error when running a query involving ROUND function in spark sql
How to add a constant column in a Spark DataFrame?
Extracting several regex matches in PySpark
How compute the percentile in PySpark dataframe for each key?
Do stages in an application run parallel in spark?
How to write spark streaming DF to Kafka topic
What are the Spark transformations that causes a Shuffle?
Spark Executor Fails to Connect to Driver in Cluster Standalone mode: "Connection refused: hostname/ip:randomport"
How to handle an AnalysisException on Spark SQL?
Write Delta format to Data Lake in AWS S3
Apache Sedona on EMR version > 6.9.0: JavaPackage object is not callable
Pyspark dataframe repartitioning puts all data in one partition
Sort in descending order in PySpark
Convert a Spark DataFrame to Pandas DF
how to filter out a null value from spark dataframe
Row object in RDD