Search code examples
apache-sparkhadoopapache-spark-sqlhdfshadoop-yarn

How to find out the total size of data read and which data is belonging to which node in Spark


Suppose I am using Apache spark to read a dataset like this:

City | Region |  Population 
A    |     A1  |     150000
A     |    A2    |   50000
B     |    B1    |   250000
C     |    C1     |  350000

After creating the dataframe on top of this suppose I repartition this based on city. Now if I wish to know which node of my spark cluster is having the information of city A, is it possible to know? If yes, then how kindly explain.

Another question please, how do I know the total size of the data which is being read by spark as a dataframe?


Solution

  • There are couple of questions here.

    1.You want to see what kind of data is being processed at each node

     Here executor nodes would only perform the operations defined in the rdd or dataframe transformations to a chunk of data that is available in partitions in that executor node.
    

    Probably the best way I believe to check the data within a node is to have the logging enabled for both driver and executor and have the log entries written within the rdd/ df operation .These logs can be published to local disk of executor and you need to connect to each of the executor nodes to verify the data belonging to each node

    1. If you want to know the total size of dataframe read in dataframe please refer below How to find spark RDD/Dataframe size?