I'm working through this
tutorial
on manipulating data frame Column
objects. The doc string for the
Column
class shows a Column
s columnar data and heading (Annex A below). In
contrast, the doc string for the
col
method shows that a Column
can specify just a heading, without
containing any data (Annex B below).
While a Column
can result from extracting columnar data from a
DataFrame, and possibly performing operations on that, in neither of
the above two manifestations does a Column
maintain a pointer to a
source table for either the heading or data. The doc string for
DataFrame
's
withColumn
method muddies that up for me. Its prototype says to not refer to
external data frames:
def withColumn(self, colName: str, col: Column) -> "DataFrame":
<...snip...>
The column expression must be an expression over this
:class:`DataFrame`; attempting to add a column from some other
:class:`DataFrame` will raise an error.
<...snip...>
As mentioned in my 2nd paragraph above, however, Column
s don't
seem to refer back to any data frames for either name information or
columnar data. All the information seems to be self-contained within
the Column
object itself.
How can I understand/resolve this apparent contradiction in what a
Column
is, and how should I understand withColumn
's stipulation
against referring to external data frames?
Further examples of the disparity can be found from other tutorials.
The DataFrame
df
from this
tutorial.
can be used to show that a Column
object resulting from evaluating
an expression contains the columnar data, e.g., df.salary+1
:
df.show(2)
+-----------------+----------+------+------+
| name| dob|gender|salary|
+-----------------+----------+------+------+
| {James, , Smith}|1991-04-01| M| 3000|
|{Michael, Rose, }|2000-05-19| M| 4000|
+-----------------+----------+------+------+
df.withColumn("salary",df.salary+1).show(2)
+-----------------+----------+------+------+
| name| dob|gender|salary|
+-----------------+----------+------+------+
| {James, , Smith}|1991-04-01| M| 3001|
|{Michael, Rose, }|2000-05-19| M| 4001|
+-----------------+----------+------+------+
In contrast, the identical DataFrame
from this
tutorial can
be used to show a "salary" column with no data. The tutorial uses
this code:
df.withColumn("salary",col("salary").cast("Integer")).show()
The key difference here is that col("salary")
makes no reference
to DataFrame
df
. In contrast to df.salary+1
, therefore, the
Column
object that results from evaluating col("salary")
cannot
possibly contain any columnar data. The only possibility is that
col("salary").cast("Integer")
contains only the column name and
(Spark) data type. This means that withColumn
takes those limited
details from the Column object, looks up that column in it's own
DataFrame (the object through which withColumn
is invoked, df
),
and obtains the data in that way.
Since a Column
object has (at least) these two manifestations, I
can picture various ways in which it might behave:
Column
object result from expression evaluation sometimes
contains the column dataColumn
object contains only the name of a column
and possibly the data type. If withColumn
gets such a 2nd
argument (in the above context, not in its signature), perhaps it
looks up the named column in in the DataFrame
through which it is
invokedColumn
object is the result of an expression with
operations, but it refers to no concrete columnar data, it is
conceivable that the operations are stored in the column object
symbolically so that withColumn
knows what to do when it
retrieves the data from the DataFrame
object through which it was
invokedColumn
object refers to a column
in a source DataFrame
, such as in df.salary+1
above, perhaps
the Column
object internally maintains a reference to the source
column and stores the operations in the expression symbolically.
When the resulting columnar data is needed, the data is retrieved
from the source column and the operations performed. This
speculated behaviour allows Column
objects to refer to external
DataFrame
s; the warnings against such external references then
make sense.With the limited doc string information, there are many possibilities
in what to expect of Column
's implementation and behaviour.
It's hard to get an understanding of the intended operation and behaviour. Hopefully, someone can shed some light and drastically reduce the
possibilities. If there is more detailed documentation for Spark modules, classes, and methods, I would appreciate a pointer to it.
Thanks.
Column
class doc string excerpt>>> df = spark.createDataFrame(
... [(2, "Alice"), (5, "Bob")], ["age", "name"])
Select a column out of a DataFrame
>>> df.name
Column<'name'>
>>> df["name"]
Column<'name'>
Create from an expression
>>> df.age + 1
Column<...>
>>> 1 / df.age
Column<...>
col
function doc string excerpt>>> col('x')
Column<'x'>
>>> column('x')
Column<'x'>
The following is not a complete answer, but it is the best answer that I could find over weeks.
According to the comments of this
answer, a Column
object does not store data. It stores the column name. From the
Column.cast()
example in the question posted above, it may also store a
data type.
It is very possible that it can store a reference to a source
DataFrame
. We see evidence of this in the examples of the posted
question, where df.
prefixes the column name. If references to a
source table cannot be stored with a Column
object, then there is
little need to prefix column names with DataFrame
names. If a
Column
object can in fact store a reference to source DataFrame
, then it is
conceivable and plausible that specifying a source DataFrame
name as a
prefix allows the resulting Column
object to inherit the data type
from the named column in the source DataFrame
. As described in the
question, withColumn
's warning against referencing external
DataFrame
's only makes sense if a Column
object can contain
a reference to a source DataFrame
.
Finally, as suggested in the posted question, it seems that an
expression that involves column operations and yields a Column
object also stores the operations symbolically. For example,
df.salary+1
might store the column name salary
, the source
DataFrame
df
, and the operation +1
. We know that df.salary
is a
Column
object that stores no data, so there seems to be little
logical alternative the the supposition that the column meta-data
(name and possibly type) are be stored along with the source
DataFrame
df
and the +1
operation.
In terms of when the data is fetched and when the operations are
performed, it could be that withColumn
evaluates the column expression, fetches
the data, and performs the operation. If there is no DataFrame
referenced, then perhaps withColumn
looks for an identically named
column in the DataFrame
object through which it was called.
It is also possible that the fetching of data and performance of
column operations is not strictly the responsibility of the
withColumn
method. It could be that the operator methods of
Column
objects (e.g., +
, -
, etc.) are defined such that they look up the
source data, fetch it, and perform the operation corresponding to the
operator.
I see the latter as less likely because some Column
objects lack a
reference to a source DataFrame
, so it would not be possible to "look
them up". If this is a valid point, then it is more likely that data
fetching and performance of operations is done by withColumn
, since
it has a host DataFrame
through which withColumn
is invoked. Hence,
Column
objects that lack a reference to a DataFrame
, can be looked up
through the host DataFrame
.
As should be clear, the answer is somewhat speculative, though
carefully reasoned. It would be nice to have more transparency into
what a Column
object is and how it is meant to work. Short of reverse
engineering the source code, if there is a better reference that
describes the Column
class, it could enhance this answer or provide a
better alternative.