Search code examples
apache-sparkhadooppysparkhdfshadoop-yarn

df.show() prints empty result while in hdfs it is not empty


I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs

in my code, i have a dataframe which is read directly from hdfs:

df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")

when i use df.show(n=2) directly in my code after the above code, it outputs:

+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+

But when i manually go to the hdfs path, data is not empty.

What i have tried?

1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.

2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist

What i am assuming?

1- i think this may have something to do with drivers and executors

2- it may i have something to do with yarn

3- configs provided when using spark-submit

current config:

spark-submit \
    --master yarn \
    --queue my_queue_name \
    --deploy-mode cluster \
    --jars some_jars \
    --conf spark.yarn.dist.files some_files \
    --conf spark.sql.catalogImplementation=in-memory \
    --properties-file some_zip_file \
    --py-files some_py_files \
    main.py

What i am sure

data is not empty. the same hdfs path is provided in another project which is working fine.


Solution

  • So the problem was with the jar files i was providing

    The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine