Search code examples
apache-sparkapache-spark-sqlapache-spark-dataset

Differences between Spark's Row and InternalRow types


Currently Spark has two implementations for Row:

import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow

What is the need to have both of them? Do they represent the same encoded entities but one used internally (internal APIs) and the other is used with the external APIs?


Solution

  • TLDR:

    • Row: Externally-facing, and immutable. They are used by developers when interacting with DataFrames and Datasets.
    • InternalRow: Internally-facing, mutable, and performance-optimized and used by Spark's engine for executing and optimizing queries.

    In-detailed

    Yes, you are correct that both Row and InternalRow serve similar purposes in representing a row of data. Still, they are designed for different use cases and environments within Spark.

    Why Does Spark Have Both Row and InternalRow?

    1. Separation of Concerns:

    • Row:

      • User-Facing API: Row is designed to be part of the public API, meaning it is intended for external use by developers when interacting with DataFrames and Datasets.
      • Immutability: Row is immutable, making it safe to use in parallel processing without concerns about accidentally modifying data.
    • InternalRow:

      • Internal API: InternalRow is designed for internal use by Spark's Catalyst optimizer and query execution engine. It is optimized for performance, particularly in memory usage and access speed.
      • Mutability: Unlike Row, InternalRow is mutable, allowing Spark to modify row data in place during execution. This mutability is critical for internal operations, such as query optimization and evaluation, where data must be adjusted on the fly.

    2. Different Use Cases:

    • External API (Row):

      • Users who work with DataFrames and Datasets need a row representation that is easy to understand and manipulate. Row provides this interface, making it straightforward to work with structured data.
      • Row is used in user-facing operations like querying, displaying, and transforming data.
    • Internal API (InternalRow):

      • Inside Spark, particularly within the Catalyst optimizer and execution engine, performance is paramount. InternalRow is designed for these internal processes where speed and memory efficiency are critical.
      • InternalRow is used in operations like logical and physical query plan execution, where the overhead of a more user-friendly interface would be detrimental to performance.

    Do They Represent the Same Encoded Entities?

    • Yes, but with Different Objectives: Both Row and InternalRow ultimately represent the same concept: a row of data in a structured format (like a DataFrame). However, they are optimized for different purposes:
      • Row is optimized for usability and safety in the context of public APIs.
      • InternalRow is optimized for performance and flexibility within Spark's internal execution engine.