Search code examples
apache-sparkpyspark

Purpose of Spark catalog/temp views


My understanding of Spark is that, in general, you would use something like Pyspark to load up for example a CSV file into dataframe(s) and then do various transformations and aggregations on them. If the result is a small aggregation you might pull the results back to Python, whereas if it's larger you might push the resulting dataframes to a warehouse as permanent SQL tables.

I know that Spark can also hold temporary SQL tables in its catalog, but that these temporary tables are not visible to others who might be logged into the Spark cluster.

So what I'm confused by is, why would you ever really want to create a SQL table that is stored directly in Spark? My understanding is that because it is neither an OLTP database nor a data warehouse, it would not be a good location to try to store permanent database tables. But why would you even want to create temporary tables since you already have your dataframe variables that you can work with?

I understand why it is useful to be able to interact with dataframes using SQL-like operations like .filter or .select, because it is easy if you understand SQL. But you don't need to create tables in the Spark catalog in order to use those functions. Basically what I'm asking is why would you use .createOrReplaceTempView() to move your dataframes into the catalog?


Solution

  • why would you even want to create temporary tables since you already have your dataframe variables that you can work with?

    1. As you guessed and Stephen mentioned: Use in SQL. You can not select * from "dataframe", but you can select * from "temp view created from a dataframe".
    2. It makes programming easier.

    Imagine writing a complex pyspark program with many classes/modules/functions. If you have a few such "dataframe variables" that you want to use in 4 different functions in 2 different classes, you'll have to pass the variables around as function parameters or member variables etc.

    Instead if you createOrReplaceTempView(), then anywhere else in the program where you want to use those "dataframe variables", you have access to the temp views by name. AND there is no overhead as it's just an alias. It makes your program much cleaner, readable, compartmentalized.

    A very crude comparison would be making a variable class member variable instead of passing it around as function parameter.


    but that these temporary tables are not visible to others who might be logged into the Spark cluster.


    If the result is a small aggregation you might pull the results back to Python, whereas if it's larger you might push the resulting dataframes to a warehouse as permanent SQL tables.

    One thing you must always remember is that Spark is lazy.

    When Spark transforms data, it does not immediately compute the transformation but plans how to compute later. When actions such as collect() are explicitly called, the computation starts.

    It will not do/execute anything until you invoke an action on it. Of course writing a dataframe to a supported sink is also an action. createOrReplaceTempView() is NOT an action.

    So onus is on YOU to define where the result is "stored". I.e. in memory, or in some persistent sink like S3, ADLS etc. E.g. Spark will simply crash with OOM if you broadcast() a 1PB Dataframe on machines with insufficient memory.


    Lastly when you look under the hood, temp views are not "materialized views", they're more like an alias.

    >>> df = spark.createDataFrame([('Alice', 1)], ['name', 'age'])
    >>> df.createOrReplaceTempView('foo9')
    >>> df2 = spark.createDataFrame([('math',),('art',)], ['subject',])
    >>> df3 = spark.sql('select * from foo9')
    >>> df4 = df2.join(df3)
    >>> df4.explain(mode='extended')
    == Parsed Logical Plan ==
    Join Inner
    :- LogicalRDD [subject#4], false
    +- Project [name#0, age#1L]
       +- SubqueryAlias foo9                           <<---
          +- View (`foo9`, [name#0,age#1L])
             +- LogicalRDD [name#0, age#1L], false
    
    == Analyzed Logical Plan ==
    subject: string, name: string, age: bigint
    Join Inner
    :- LogicalRDD [subject#4], false
    +- Project [name#0, age#1L]
       +- SubqueryAlias foo9
          +- View (`foo9`, [name#0,age#1L])
             +- LogicalRDD [name#0, age#1L], false
    
    == Optimized Logical Plan ==
    Join Inner
    :- LogicalRDD [subject#4], false
    +- LogicalRDD [name#0, age#1L], false
    
    == Physical Plan ==
    CartesianProduct
    :- *(1) Scan ExistingRDD[subject#4]
    +- *(2) Scan ExistingRDD[name#0,age#1L]
    
    >>>