Search code examples
apache-sparkdelta-laketrino

Delta Lake connector query change data feed entries of the table


Starting from version 408, Trino adds support for creating tables with the Trino change_data_feed_enabled table property. I am using Trino version 413.

I already have some delta table and data in AWS S3, which is built from using PySpark, with change data feed enabling. When I create the table through Trino Delta connector, I set this property to True. When I query the table, it will return the latest image of each record by its defined key. It seems no difference comparing to set this property to False.

If you use Spark SQL to read the delta table, you can configure the read option readChangeFeed and control the result, which is the latest image or all the historical change log of the data.

How can I write the SQL statement in Trino, such that I can perform the similar reading control just like setting readChangeFeed in PySpark?

Example of how I create a table through Trino Delta Lake connector:

CREATE TABLE delta.table_collection.table_name (
    id varchar,
    value_1 varchar,
    value_2 integer,
    log_status varchar,
    ts bigint,
) with (
    location = 's3://path/to/table',
    checkpoint_interval = 7,
    change_data_feed_enabled = true
);

Solution

  • Starting from version 419, Trino enable to read the CDF of delta table. You can check it here.