Search code examples
apache-sparkpysparkhiveapache-spark-sql

Does spark bring entire hive table to memory


I am in the process of learning the working of Apache Spark and have some basic queries. Let's say I have a Spark application running which connects to a Hive table. My hive table is as follows:

Name Age Marks
A 50 100
B 50 100
C 75 200

When I run the following code snippets, which rows and columns will be loaded into memory during the execution? Will the filtering of rows/columns be done after the entire table is loaded into the memory?

1. spark_session.sql("SELECT name, age from table").collect()
2. spark_session.sql("SELECT * from table WHERE age=50").collect()
3. spark_session.sql("SELECT * from table").select("name", "age").collect()
4. spark_session.sql("SELECT * from table").filter(df.age = 50).collect()

Solution

  • If the datasource supports predicate pushdown then spark will not load entire data to memory while filtering the data.

    Let's check the spark plan for hive table with parquet as file format:

    >>> df = spark.createDataFrame([('A', 25, 100),('B', 30, 100)], ['name', 'age', 'marks'])
    >>> df.write.saveAsTable('table')
    >>> spark.sql('select * from table where age=25').explain(True)
    
    == Physical Plan ==
    *(1) Filter (isnotnull(age#1389L) AND (age#1389L = 25))
    +- *(1) ColumnarToRow
       +- FileScan parquet default.table[name#1388,age#1389L,marks#1390L] Batched: true, DataFilters: [isnotnull(age#1389L), (age#1389L = 25)], 
        Format: Parquet, Location: InMemoryFileIndex[file:/Users/mohan/spark-warehouse/table], 
        PartitionFilters: [], PushedFilters: [IsNotNull(age), EqualTo(age,25)], ReadSchema: struct<name:string,age:bigint,marks:bigint>
    

    You can verify if filter pushed to underlying storage by looking at PushedFilters: [IsNotNull(age), EqualTo(age,25)]