Search code examples
scalahadoopapache-sparkhdfsparquet

Spark write to parquet on hdfs


I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .

When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.

scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")

Is this the intended behaviour or should all blocks be distributed across the cluster?

Thanks


Solution

  • Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide

    Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy).

    So yes, this is the intended behaivour.