Search code examples
postgresqlcountrowverificationdatabase-table

How to check content loading status on a static database?


We have a static database we constantly update with loader scripts. These loader scripts get current information from third party sources, clean it and upload it to database.

I have already made some SQL scripts to ensure schemas and tables required exists. Now I'd like to check that each table has the expected row count.

I did something like this:

select case when count(*) = <someNumber>
  then 'someSchema.someTable OK'
  else 'someSchema.someTable BAD row count' end
from someSchema.someTable;

But doing these kind of queries for ~300 tables is cumbersome.

Now I was thinking maybe there's a way to have a table like:

create table expected_row_count (
  schema_name varchar,
  table_name varchar,
  row_count bigint
);

And somehow test all listed tables and only output the ones that fail the count check. But I'm kind of missing now... Should I try to write a function? Can a table like this be used to build queries and execute them?


Solution

  • Whole credit goes to @a-horse_with*_no_name , I'm posting a reply for completeness:

    Check row count

    First let's create some data to test the query:

    create schema if not exists data;
    
    create table if not exists data.test1 (nothing int);
    create table if not exists data.test2 (nothing int);
    
    insert into data.test1 (nothing)
      (select random() from generate_series(1, 28));
    
    insert into data.test2 (nothing)
      (select random() from generate_series(1, 55));
    
    create table if not exists public.expected_row_count (
      table_schema varchar not null default '',
      table_name varchar not null default '',
      row_count bigint not null default 0
    );
    
    insert into public.expected_row_count (table_schema, table_name, row_count) values
      ('data', 'test1', (select count(*) from data.test1)),
      ('data', 'test2', (select count(*) from data.test2))
      ;
    

    Now the query to check the data:

    select * from (
      select
        table_schema,
        table_name,
        (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count
      from (
        select
          table_schema,
          table_name,
          query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count
        from information_schema.tables
        where table_schema = 'data' --<< change here for the schema you want
      ) infs ) as r
    
    inner join expected_row_count erc
      on r.table_schema = erc.table_schema
      and r.table_name = erc.table_name
      and r.row_count != erc.row_count
     ;
    

    Previous query should give an empty results if all counts are ok, and the tables with missing data if not. To check it, update the count for some table on expected_row_count and re-run the query. For example:

    update expected_row_count set row_count = 666 where table_name = 'test1';