My application needs to store millions of binary blocks in a Postgres database (a few thousands of those might arrive per second). Most of the blocks are 16K in size though some might be smaller. I understand that I can use text, bytea or blob columns or I can store the binary data in files outside the database and put their paths in the table.
Considering that high write-throughput is my most important goal, which option is the most appropriate for my situation?
bytea
is the sensible option here - almost the only option.
There is no advantage to using text
, varchar
etc. Don't store encoded binary in them. That's an option you should immediately disregard.
There is no blob
type in PostgreSQL. I think you might mean lob
, which is a wrapper for oid
used for looking up "large objects" in the pg_largeobject
table. This is useful when you need virtual "files" in the database where you can seek, overwrite, append, etc, but it's not at all suited to your use case.
You could store paths or filenames then look them up externally, but you're going to have a lot of very small files. You're also going to need a sidechannel for clients to read and write them, since you can't use the PostgreSQL protocol directly. You'll need to handle backup/restore and replication for them separately. They won't get deleted if a transaction rolls back or if the corresponding database tuple gets deleted so you'll need a cleanup system to remove no-longer-needed files. It'll get messy. This is worth doing when the files are big, long-lived, and mostly static, but it doesn't sound like that's the case for you.
Store the binary in bytea
columns directly, and preferably use the binary protocol support in PgJDBC or libpq
to exchange the bytea values between client and server without needing encoding. Have minimal indexes on the table you write to. (Under some circumstances you can even go without defining a primary key, but that's kind of an expert level option). If you don't mind losing the data in the table on unplanned restart, use an unlogged table. Otherwise batch writes and use asynchronous commits and/or a commit delay.
See also How to speed up insertion performance in PostgreSQL