sql relational-algebra database-optimization database-theory

Can modern SQL syntax be translated into a relational algebra tree?

I have a parsed representation of an SQL query in the form of an abstract syntax tree. I'd like to convert it into an intermediate logical tree representation, which can in turn be expanded into candidate physical operation trees by the optimizer.

Following various resources discussing the topic (CMU lecture, Database Management System, Dayal's 1987 paper on nested subqueries), I managed to translate most of the query into an intermediate algebraic form (e.g. π_a(R⋈S)) but I get stuck at translating common table expressions and lateral joins.

I can't find a way to include CTEs in the algebraic tree without inlining all references to them. While running down the algebraic tree, the optimizer would come across the same expanded CTE, which seems inefficient.
By definition, subqueries in lateral joins can reference columns provided by preceding FROM items. When using a relational algebra tree representation, FROM items preceding the subquery are parent nodes, but join nodes can only reference columns output by child nodes. For instance:

SELECT *
FROM
  (SELECT 1 id) A,
  (
    (SELECT 2) B
    JOIN LATERAL (SELECT A.id) C ON TRUE
  ) D;

    NAT. JOIN
        ├── A
        └── LATERAL JOIN (D)
            ├── B
            └── C <- should be able to reference A

Can these constraints be expressed in terms of algebraic operations or do they typically need bespoke data structures? For example, separate subtrees for CTEs instead of a single tree for the whole logical plan - but then, how can the optimizer optimize the query globally instead of each subtree individually?

More generally, are there better intermediate representations than algebraic trees for the purpose of query optimization?

Solution

While algebraic trees are quite common as intermediate representations, they may not capture all the complexities of modern SQL queries. Using more flexible representations such as DAGs, augmenting trees with annotations, and maintaining logical/physical separation can help address the challenges posed e.g. by CTEs and lateral joins.

For instance:

with Directed Acyclic Graphs: each node in the DAG represents an operation, and edges represent data flow. This allows for more complex relationships between operations, such as lateral joins referencing columns from preceding FROM items.
Relational Operator Trees with Annotations: Augmenting the relational operator tree with annotations or metadata can help capture additional information needed for optimization, such as context for lateral joins, bindings, or the presence of CTEs.

At to inlining CTEs vs. using subtrees: inlining CTEs into the algebraic tree can lead to redundancy and inefficiency, as you've rightly pointed out. However, in some cases, inlining might be the simplest approach for optimization. For more complex queries or for optimization across multiple queries, maintaining a separate representation for CTEs might be beneficial. This could involve storing the CTEs separately and referencing them as needed in the main query plan.

For further reading on the topic and in addition to the sources you already mentioned, I would recommend "System R: Relational Approach to Database Management" by Chamberlin, D.D., et al. which discusses the System R project and its internals, "Volcano: An Extensible and Parallel Query Evaluation System" by Graefe, G., as well as Postgres' parser and optimiser implementations.