Currently Spark has two implementations for Row:
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
What is the need to have both of them? Do they represent the same encoded entities but one used internally (internal APIs) and the other is used with the external APIs?
Row
: Externally-facing, and immutable. They are used by developers when interacting with DataFrames and Datasets.InternalRow
: Internally-facing, mutable, and performance-optimized and used by Spark's engine for executing and optimizing queries.Yes, you are correct that both Row
and InternalRow
serve similar purposes in representing a row of data. Still, they are designed for different use cases and environments within Spark.
Why Does Spark Have Both
Row
andInternalRow
?
Row
:
Row
is designed to be part of the public API, meaning it is intended for external use by developers when interacting with DataFrames and Datasets.Row
is immutable, making it safe to use in parallel processing without concerns about accidentally modifying data.InternalRow
:
InternalRow
is designed for internal use by Spark's Catalyst optimizer and query execution engine. It is optimized for performance, particularly in memory usage and access speed.Row,
InternalRow
is mutable, allowing Spark to modify row data in place during execution. This mutability is critical for internal operations, such as query optimization and evaluation, where data must be adjusted on the fly.External API (Row
):
Row
provides this interface, making it straightforward to work with structured data.Row
is used in user-facing operations like querying, displaying, and transforming data.Internal API (InternalRow
):
InternalRow
is designed for these internal processes where speed and memory efficiency are critical.InternalRow
is used in operations like logical and physical query plan execution, where the overhead of a more user-friendly interface would be detrimental to performance.Do They Represent the Same Encoded Entities?
Row
and InternalRow
ultimately represent the same concept: a row of data in a structured format (like a DataFrame). However, they are optimized for different purposes:
Row
is optimized for usability and safety in the context of public APIs.InternalRow
is optimized for performance and flexibility within Spark's internal execution engine.