Search code examples
databasepostgresqldatabase-designdatabase-optimization

optimisation of multiple left joined tables for fulltextsearch over multiple columns with offset, limit and count for pagination


I currently have four tables that need to be joined with a left join. There are about 400_000 (400 thousand) of data records (which will grow in the future).

In addition to the left join, the following things must be able to be executed performantly:

  • A full text search using the columns purchasing.mandant_id, purchasing.company_code, purchasing_mandant_names.name, company_codes.name, purchasing.sequence_number, purchasing.year, purchasing.month.
  • The full text search must also work with partial words and ignore upper and lower case.
  • A number of values found (to calculate how many pages there are).
  • And an offset, limit (or an alternative, which basically works the same) to specify a fixed page number for each search (no cursor functionality and we don't know what the last found id is).
  • We also don't know if the next page number is the next higher number. Basically, a much higher number of pages can be given immediately.

I myself had several ideas to optimise this requirement.

full text search:

  • a separate table where the columns for the full text search are concatenated in a text field (p.client_id::text || ' '::text || pmn.name::text) and additionally contains the id for purchasing with a gin index (gin_trgm_ops)
  • a completely different database system, like sonic or elasticsearch for the full text search, but I don't know how best to synchronise the data and query it in the program.
  • I have found something along the lines of zombodb. Unfortunately, it has the disadvantage that deleted records are not removed from elasticsearch. And zombodb only works on a linux server and I prefer and use windows.

offset, limit:

  • limit the search to purchasing.id (select purchasing.id from left join) and get all data after these ids (if this really leads to an optimisation is not sure)
  • add another column which always contains the current row number with an index, this would then also have to be completely rebuilt in case of deletions. Unfortunately, it doesn't really help in general, because with a filter over this data, the row numbers are of course no longer leading.
  • the same applies to the id with the addition that gaps can also occur in the id when deleting.

count

  • I can't think of any optimisation at the moment, because the entire data set always has to be filtered through. The only current option would be to use the id search to return only the ids and count them. These ids can then be used to pull the rest of the required data.

Adding columns or tables to help is possible. Changing the structure is also possible as long as all data is retained.

It is also possible to change the database system if postgres does not have the required functions.

  • relational
  • open source
  • connectable with php and nodejs

the tables:

CREATE TABLE company_codes (
    id int2 NOT NULL,
    name varchar(20) NOT NULL,
    CONSTRAINT company_codes_pk PRIMARY KEY (id)
);

CREATE TABLE purchasing_mandant_names (
    id int2 NOT NULL,
    "name" varchar(50) NOT NULL,
    CONSTRAINT archiving_coding_keys_pkey PRIMARY KEY (id),
    CONSTRAINT purchasing_mandant_names_un UNIQUE (name)
);

CREATE TABLE purchasing (
    id int4 NOT NULL GENERATED BY DEFAULT AS IDENTITY,
    mandant_id int2 NOT NULL,
    company_code int4 NOT NULL,
    sequence_number int8 NOT NULL,
    "year" int2 NOT NULL,
    "month" int2 NOT NULL,
    CONSTRAINT purchasing_check_month CHECK (((month >= 1) AND (month <= 12))),
    CONSTRAINT purchasing_check_year CHECK (((year >= 2000) AND (year <= 2100))),
    CONSTRAINT purchasing_client_plant_sequence_number_key UNIQUE (mandant_id, company_code, sequence_number),
    CONSTRAINT purchasing_pkey PRIMARY KEY (id)
);

ALTER TABLE purchasing ADD CONSTRAINT purchasing_company_code_fk FOREIGN KEY (company_code) REFERENCES company_codes(id);
ALTER TABLE purchasing ADD CONSTRAINT purchasing_mandant_id_fkey FOREIGN KEY (mandant_id) REFERENCES purchasing_mandant_names(id) NOT VALID;

CREATE TABLE purchasing_pages (
    purchasing_id int8 NOT NULL,
    page_path varchar(255) NOT NULL,
    CONSTRAINT purchasing_pages_check_page_path CHECK (((page_path)::text ~ '^([a-zA-Z0-9\-_ ]+\/)+[a-zA-Z0-9\-_ ~]+\.[tT][iI][fF]'::text)),
    CONSTRAINT purchasing_pages_purchasing_id_page_path_key UNIQUE (purchasing_id, page_path)
);
ALTER TABLE purchasing_pages ADD CONSTRAINT purchasing_pages_id_fkey FOREIGN KEY (purchasing_id) REFERENCES purchasing(id) ON UPDATE CASCADE ON DELETE RESTRICT;

the current slow search

SELECT p.id,
    p.mandant_id AS mandant_code,
    pmn.name AS mandant_name,
    p.company_code,
    cc.name AS company_name,
    p.sequence_number,
    p.year,
    p.month,
    pp.page_path
   FROM purchasing p
     LEFT JOIN purchasing_pages pp ON p.id = pp.purchasing_id
     LEFT JOIN purchasing_mandant_names pmn ON p.mandant_id = pmn.id
     LEFT JOIN company_codes cc ON p.company_code = cc.id
       where p.mandant_id::text like '%32%'
        or pmn.name::text like '%32%'
        or p.company_code::text like '%32%'
        or cc.name::text like '%32%'
        or p.sequence_number::text like '%32%'
        or p.year::text like '%32%'
        or p.month::text like '%32%'
     order by p.id desc
     offset 10 fetch first 10 rows only

er diagramm


Solution

  • My solution: (I think there will be better options, but these are the only ones I have found myself)

    full text search:

    I created a gin_trgm_ops for a each column. I was also interested in wildspeed. Unfortunately, this extension is no longer supported. wildspeed would also have had the advantage of using smaller lengths < 3. Maybe I would look in this direction again in the future to see if there is a similar extension.

    offset, limit:

    Unfortunately, there are no optimisation options when a filter is applied. To optimise the performance, I have built several options into the application itself. If you get a direct link to a page, offset, limit must be used, because there is no other possibility. Besides this possibility, there is now an offset - previous offset and the last id. This way, the index is at least used a little for further navigation. But just for the direct jump there are no optimnierungsmöglichkeiten, because the data are also filtered.

    count

    I use a count(*) OVER() directly in the combination with the offset limit search filter, so only one query needs to be executed. It is the only optimization I have found for this.