Search code examples
databasepostgresqlpostgresql-performance

Should I partition my postgres tables?


I am storing some stock data in a Postgres 9.5 database with the following format for my tables with a primary key on (datetime, symbol):

symbol (varchar[30]),
datetime (timestamptz),
value (double precision) 

Right now, some of my bigger tables are over 80 million lines and as a result, some queries are a bit slower than I would like. 99% of my queries involve grabbing a day's worth of data for a specific symbol, something like:

SELECT * from "prices" 
WHERE symbol = 'AAPl' AND datetime between '2016-07-22 9:30' AND '2016-07-22 16:30'

On an average day I will insert 250k rows but on some spikey days it can go as high as 500k rows. This is over the 6.5 hours that the American market is open

I was reading into partitions and thinking about doing it monthly (average 20 trading days in a month, I should have between 5 and 10 million lines per partition)

I am not experienced at databases and what I have setup so far is the work of a novice. As I look into writing the automatic partitioning scripts, it seems like they have some type of query in them searching to see if the partition exists. The following is taken from a much longer function located at https://blog.engineyard.com/2013/scaling-postgresql-performance-table-partitioning

PERFORM 1
FROM   pg_catalog.pg_class c
JOIN   pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relkind = 'r'
AND    c.relname = _tablename
AND    n.nspname = 'myschema';

I don't really want to be checking to see if a table exists 250k to 500k times per day so I was thinking that I can just run a cron job to create the table on a monthly basis then not bother with the checks?

I was also thinking that maybe I should just insert all of my data into a temporary table and then run a cron job after the market closes at 4:30pm as no data gets inserted once the market closes. The thinking being that my cron job would empty the temp table and put everything into the correct partition.

Is this worth doing? Should I look into sharding instead? Database server is an i7 6 core processor with 64 gig of ram and data stored on SSD drives. I could spin up more postgres instances on same server but I don't have access to more servers so the postgres instances would have to live on the same server.

Also, while I am here, any recommendations on indexes I should have on my tables to make my queries like above go faster?

Thanks!


Solution

  • Is partitioning possible in your scenario?

    Postgresql partitioning is built on table inheritance. Before you proceed any further you need to be aware of this limitation of inheritance.

    A serious limitation of the inheritance feature is that indexes (including unique constraints) and foreign key constraints only apply to single tables, not to their inheritance children. This is true on both the referencing and referenced sides of a foreign key constraint.

    If you are fortunate enough not to have foreign keys, yes partition can be made use of.

    Partitioning

    I don't know what you have been reading online but partitioning is straight forward if you read the official documentation. Well that's true for new tables. For old tables it's a bit trickier. My strategy for dealing with old tables is something like this (it may not be the best):

    1) CREATE newtable LIKE oldtable;
    2) create partitions on newtable
    3) move the data into the newtable
    4) drop the old table and replace it with a view
    5) create triggers to make the view writable (this is pretty standard you will find examples in the docs)

    Don't create too many partitions, one per month is reasonable.

    Normalize

    This is not an alternative to partitioning. This is something that you should seriously think about. You have a symbol varchar(30) why don't you create a table called symbols and put them all in there? Then you only have a int field in this table. Assuming an average symbol length of 10, you will be shaving 6 bytes per record in the table. The index will also shrink and that helps.

    Double to int

    Another optimization is to switch from double to int that will save another 4 bytes (you will have to store the prices in minor currency format). With this and the above, we've reduced at least 30% of the tables size!

    Manual table splitting

    I was also thinking that maybe I should just insert all of my data into a temporary table and then run a cron job after the market closes at 4:30pm as no data gets inserted once the market closes.

    Almost but not quite. Insert everything into the main table. After the market closes, use a cron job to move data older than 30 days into an archive table.

    Summary Table

    When tracking long term trends you only need High, Low, Close, Open you don't need the intra day flucuations. Create a summary table with this data (populated by your cron) and use this for trend analysis. Use the main table only for intra day or short term trends.