Search code examples
apache-sparkapache-spark-sqlapache-spark-dataset

Ambiguity in definition of Spark "Column": What does it contain and how/when is the data used?


I'm working through this tutorial on manipulating data frame Column objects. The doc string for the Column class shows a Columns columnar data and heading (Annex A below). In contrast, the doc string for the col method shows that a Column can specify just a heading, without containing any data (Annex B below).

While a Column can result from extracting columnar data from a DataFrame, and possibly performing operations on that, in neither of the above two manifestations does a Column maintain a pointer to a source table for either the heading or data. The doc string for DataFrame's withColumn method muddies that up for me. Its prototype says to not refer to external data frames:

def withColumn(self, colName: str, col: Column) -> "DataFrame":
        <...snip...>
    The column expression must be an expression over this
    :class:`DataFrame`; attempting to add a column from some other
    :class:`DataFrame` will raise an error.
        <...snip...>

As mentioned in my 2nd paragraph above, however, Columns don't seem to refer back to any data frames for either name information or columnar data. All the information seems to be self-contained within the Column object itself.

How can I understand/resolve this apparent contradiction in what a Column is, and how should I understand withColumn's stipulation against referring to external data frames?

Further examples of the disparity can be found from other tutorials. The DataFrame df from this tutorial. can be used to show that a Column object resulting from evaluating an expression contains the columnar data, e.g., df.salary+1:

df.show(2)
+-----------------+----------+------+------+
|             name|       dob|gender|salary|
+-----------------+----------+------+------+
| {James, , Smith}|1991-04-01|     M|  3000|
|{Michael, Rose, }|2000-05-19|     M|  4000|
+-----------------+----------+------+------+

df.withColumn("salary",df.salary+1).show(2)
+-----------------+----------+------+------+
|             name|       dob|gender|salary|
+-----------------+----------+------+------+
| {James, , Smith}|1991-04-01|     M|  3001|
|{Michael, Rose, }|2000-05-19|     M|  4001|
+-----------------+----------+------+------+

In contrast, the identical DataFrame from this tutorial can be used to show a "salary" column with no data. The tutorial uses this code:

df.withColumn("salary",col("salary").cast("Integer")).show()

The key difference here is that col("salary") makes no reference to DataFrame df. In contrast to df.salary+1, therefore, the Column object that results from evaluating col("salary") cannot possibly contain any columnar data. The only possibility is that col("salary").cast("Integer") contains only the column name and (Spark) data type. This means that withColumn takes those limited details from the Column object, looks up that column in it's own DataFrame (the object through which withColumn is invoked, df), and obtains the data in that way.

Since a Column object has (at least) these two manifestations, I can picture various ways in which it might behave:

  • The Column object result from expression evaluation sometimes contains the column data
  • Sometimes, a Column object contains only the name of a column and possibly the data type. If withColumn gets such a 2nd argument (in the above context, not in its signature), perhaps it looks up the named column in in the DataFrame through which it is invoked
  • If a Column object is the result of an expression with operations, but it refers to no concrete columnar data, it is conceivable that the operations are stored in the column object symbolically so that withColumn knows what to do when it retrieves the data from the DataFrame object through which it was invoked
  • If the expression that yields a Column object refers to a column in a source DataFrame, such as in df.salary+1 above, perhaps the Column object internally maintains a reference to the source column and stores the operations in the expression symbolically. When the resulting columnar data is needed, the data is retrieved from the source column and the operations performed. This speculated behaviour allows Column objects to refer to external DataFrames; the warnings against such external references then make sense.

With the limited doc string information, there are many possibilities in what to expect of Column's implementation and behaviour. It's hard to get an understanding of the intended operation and behaviour. Hopefully, someone can shed some light and drastically reduce the possibilities. If there is more detailed documentation for Spark modules, classes, and methods, I would appreciate a pointer to it.

Thanks.

Annex A: Column class doc string excerpt

>>> df = spark.createDataFrame(
...      [(2, "Alice"), (5, "Bob")], ["age", "name"])

Select a column out of a DataFrame
>>> df.name
Column<'name'>
>>> df["name"]
Column<'name'>

Create from an expression

>>> df.age + 1
Column<...>
>>> 1 / df.age
Column<...>

Annex B: col function doc string excerpt

>>> col('x')
Column<'x'>
>>> column('x')
Column<'x'>

Solution

  • The following is not a complete answer, but it is the best answer that I could find over weeks.

    According to the comments of this answer, a Column object does not store data. It stores the column name. From the Column.cast() example in the question posted above, it may also store a data type.

    It is very possible that it can store a reference to a source DataFrame. We see evidence of this in the examples of the posted question, where df. prefixes the column name. If references to a source table cannot be stored with a Column object, then there is little need to prefix column names with DataFrame names. If a Column object can in fact store a reference to source DataFrame, then it is conceivable and plausible that specifying a source DataFrame name as a prefix allows the resulting Column object to inherit the data type from the named column in the source DataFrame. As described in the question, withColumn's warning against referencing external DataFrame's only makes sense if a Column object can contain a reference to a source DataFrame.

    Finally, as suggested in the posted question, it seems that an expression that involves column operations and yields a Column object also stores the operations symbolically. For example, df.salary+1 might store the column name salary, the source DataFrame df, and the operation +1. We know that df.salary is a Column object that stores no data, so there seems to be little logical alternative the the supposition that the column meta-data (name and possibly type) are be stored along with the source DataFrame df and the +1 operation.

    In terms of when the data is fetched and when the operations are performed, it could be that withColumn evaluates the column expression, fetches the data, and performs the operation. If there is no DataFrame referenced, then perhaps withColumn looks for an identically named column in the DataFrame object through which it was called.

    It is also possible that the fetching of data and performance of column operations is not strictly the responsibility of the withColumn method. It could be that the operator methods of Column objects (e.g., +, -, etc.) are defined such that they look up the source data, fetch it, and perform the operation corresponding to the operator.

    I see the latter as less likely because some Column objects lack a reference to a source DataFrame, so it would not be possible to "look them up". If this is a valid point, then it is more likely that data fetching and performance of operations is done by withColumn, since it has a host DataFrame through which withColumn is invoked. Hence, Column objects that lack a reference to a DataFrame, can be looked up through the host DataFrame.

    As should be clear, the answer is somewhat speculative, though carefully reasoned. It would be nice to have more transparency into what a Column object is and how it is meant to work. Short of reverse engineering the source code, if there is a better reference that describes the Column class, it could enhance this answer or provide a better alternative.