Search code examples
monetdb

See number of live/dead tuples in MonetDB


I'm trying to get some precise row counts for all tables, given that some have deleted rows. I have been using sys.storage.count. But this seems to count the deleted ones also.

I assume using sys.storage would be simpler and faster than looping through count(*) queries, though both strategies may be fine in practice.

Maybe there is some column that counts modifications so I could just subtract the two counts?


Solution

  • If all you need to know is the number of actual rows in a table, I'd recommend just using a count(*) query. It's very fast. Even if you have N tables, it's easy to do a count(*) for each table.

    sys.storage gives you information from the raw storage. With that, you can get pretty low-level information, but it has some edges. sys.storage.count returns the count in the storage, hence, indeed, it includes the delete rows since they are not actually deleted. As of Jul2021 version of MonetDB, deleted rows are automatically overwritten by new inserts (i.e. auto-vacuuming). So, to get the actual row count, you need to look up the 'deletes' from sys.deltas('<schema>', '<table>'). For instance:

    sql>create table tbl (id int, city string);
    operation successful
    sql>insert into tbl values (1, 'London'), (2, 'Paris'), (3, 'Barcelona');
    3 affected rows
    sql>select * from tbl;
    +------+-----------+
    | id   | city      |
    +======+===========+
    |    1 | London    |
    |    2 | Paris     |
    |    3 | Barcelona |
    +------+-----------+
    3 tuples
    sql>select schema, table, column, count from sys.storage where table='tbl';
    +--------+-------+--------+-------+
    | schema | table | column | count |
    +========+=======+========+=======+
    | sys    | tbl   | city   |     3 |
    | sys    | tbl   | id     |     3 |
    +--------+-------+--------+-------+
    2 tuples
    sql>select id, deletes from sys.deltas ('sys', 'tbl');
    +-------+---------+
    | id    | deletes |
    +=======+=========+
    | 15569 |       0 |
    | 15570 |       0 |
    +-------+---------+
    2 tuples
    

    After we delete one row, the actual row count is sys.storage.count - sys.deltas ('sys', 'tbl').deletes:

    sql>delete from tbl where id = 2;
    1 affected row
    sql>select * from tbl;
    +------+-----------+
    | id   | city      |
    +======+===========+
    |    1 | London    |
    |    3 | Barcelona |
    +------+-----------+
    2 tuples
    sql>select schema, table, column, count from sys.storage where table='tbl';
    +--------+-------+--------+-------+
    | schema | table | column | count |
    +========+=======+========+=======+
    | sys    | tbl   | city   |     3 |
    | sys    | tbl   | id     |     3 |
    +--------+-------+--------+-------+
    2 tuples
    sql>select id, deletes from sys.deltas ('sys', 'tbl');
    +-------+---------+
    | id    | deletes |
    +=======+=========+
    | 15569 |       1 |
    | 15570 |       1 |
    +-------+---------+
    2 tuples
    

    After we insert a new row, the deleted row is overwritten:

    sql>insert into tbl values (4, 'Praag');
    1 affected row
    sql>select * from tbl;
    +------+-----------+
    | id   | city      |
    +======+===========+
    |    1 | London    |
    |    4 | Praag     |
    |    3 | Barcelona |
    +------+-----------+
    3 tuples
    sql>select schema, table, column, count from sys.storage where table='tbl';
    +--------+-------+--------+-------+
    | schema | table | column | count |
    +========+=======+========+=======+
    | sys    | tbl   | city   |     3 |
    | sys    | tbl   | id     |     3 |
    +--------+-------+--------+-------+
    2 tuples
    sql>select id, deletes from sys.deltas ('sys', 'tbl');
    +-------+---------+
    | id    | deletes |
    +=======+=========+
    | 15569 |       0 |
    | 15570 |       0 |
    +-------+---------+
    2 tuples
    

    So, the formula to compute the actual row count (sys.storage.count - sys.deltas ('sys', 'tbl').deletes) is generally applicable. sys.deltas() keeps stats for every column of a table, but the count and deletes are table wide, so you only need to check one column.