Search code examples
postgresqlmemory-managementdatabase-designdatabase-performance

Shared Scanning in Postgres


In the 11th lecture of the CMU Intro to Databases course (2020, 39:37), Andy Pavlo states that "only the high end data systems support shared buffer scanning but Postgres and MySql cannot". He does not expand and thus, I tried to find out why but couldn't find any abstracted information and wanted to ask here before I dove into the documentation. Did Andy mean that Postgres cannot support this due to its implementation, or has it simply not been implemented yet?

If it cannot be implemented, what about the Postgres design prevents it from doing so? How can this be circumvented? If it is possible, what is preventing the implementation today? Thanks in advance.


Solution

  • Listening to the talk, he says something like:

    If we do a merge join, we have got to sort the tables. Now if we detect that two queries want to sort the same data at the same time, it would be cool if the queries could piggy-back onto each other. The high-end systems can do that, but Postgres and MySQL cannot.

    That's only partly true.

    It is true that each backend (each query) that wants to sort has to do so on its own, and there is no way of sharing sorted results.

    But I don't think that would be a very valuable feature:

    Any two queries will likely see different versions of the data (imagine a row inserted between the start of the two queries), so they couldn't share the result anyway. So this could only be used if two queries want to sort the exact same set of rows in exactly the same way at approximately the same time, which seems like too much of a corner case to add a complicated feature for. Sharing data between PostgreSQL backends is difficult because of the multi-process architecture of PostgreSQL.

    But what PostgreSQL can do (and here the speaker is wrong) is to have two queries share a sequential scan of the same table: if you leave synchronize_seqscans at its default value of on, a second query that wants to scan the same table as an already running query will just piggy-back on to the running sequential scan. That is easier, because the data are in shared_buffers, which is a shared resource. This feature reduces I/O if you have many concurrent sequential scans of the same table.