Search code examples
databasepostgresqldatabase-performance

How to store 300M records in Postgresql to run efficiency queries


I have following table:

CREATE TABLE public.shop_prices
(
    shop_name text COLLATE pg_catalog."default",
    product text COLLATE pg_catalog."default",
    product_category text COLLATE pg_catalog."default",
    price text COLLATE pg_catalog."default"
)

and for this table i have a dataset from 18 months. In each file there are about 15M records. I have to some analysis, like in which month a shop has increased or decreased their price. I imported two months in a table and run following query just to test:

select shop, product from shop_prices group by shop, product limit 10

I waited more than 5 minutes, but no any result and response. It was still on working. What is the best way the store these datasets and run efficiency queries? Is it a good idea if I create for each dataset a seperate tables?


Solution

  • Using explain analyze select shop_name, product from shop_prices group by shop, product limit 10 you can see how Postgres is planning and executing the query and the time the execution takes. You'll see it needs to read the whole table (with the time consuming disk reads) and then sort it in memory - which will probably need to be cached on disk, before returning the results. In the next run you might discover the same query is very snappy if the number of shop_name+product combinations are very limited and thus stored in pg_stats after that explain analyze. The point being that a simple query like this can be deceiving.

    You will faster execution by creating an index on the columns you are using (create index shop_prices_shop_prod_idx on public.shop_prices(shop_name,product)).

    You should definitely change the price column type to numeric (or float/float8)) if you plan to do any numerical calculations on it.

    Having said all that, I suspect this table is not what you will be using as it does not have any timestamp to compare prices between months to begin with.

    I suggest you complete the table design and speculate on indices to improve performance. You might even want consider table partitioning https://www.postgresql.org/docs/current/ddl-partitioning.html

    You will probably be doing all sorts of queries on this data so there is no simple solution to them all.

    By all means return with perhaps more specific questions with complete table description and the output from the explain analyze statement for queries you are trying out and get some good advice.

    Best regards,
    Bjarni