Say, I am building a camera app. Every time the user clicks a photo, the image is stored on the cloud. Because I want to restrict how many images are stored on the cloud, the app gets 10 URLs in an array called listURLs
when it is initialised.
The first 10 clicks get PUT
into the cloud, exhausting listURLs
. Then, every time a click happens, a coin toss determines whether the latest click replaces an existing click on the cloud. Typical numbers would be 50 clicks, first 10 clicks get assigned a URL, and of the remaining 40 clicks, 20 of them overwrite an existing URL.
I store records of each app session in a Postgres DB. Each session will have an ID and all instances of clicks (which may or may not have a corresponding url). I also need to know the url
corresponding to each click, if one exists. So, if there are 30 clicks, I will need to know which 10 of these have a corresponding url
.
I can think of two ways of storing this data.
tblClicksURLs
as a Table that has click_id
, url
and url_active
as its fields. Every time a click_id
and non-null url
need to be inserted, update all other records with the same url
to have url_active
as false
.
Two tables tblClicks
and tblURLs
. tblURLs
has a click_id
foreign key. Every time a click_id
and non-null url
need to be inserted, the click_id
gets inserted into tblClicks
and click_id
and url
get upserted into tblURLs
. The upsert is based on whether the url
already exists in tblURLs
. So, for a given url
, there will only be one click_id
in tblURLs
So, in Case 1, I will have an UPDATE
of url_active
followed by INSERT
on the same table. In Case 2, I will have an INSERT
into one table and an UPSERT
into another. I will need indexing on click_id
, but not on url
.
If you are looking at writes of > 10k rows per second, maybe even more, which of these two would be more efficient? Assume that the numbers per session are similar to the one quoted above (50 clicks, etc.)
I could also register a created_at
datetime for each record in Case 1, and just use the first non-null url
ordered reverse-chronologically. But, I am trying to avoid this, unless the performance benefits are enormous.
After thinking about it for a while, I decided to use UPSERT along with a Unique Constraint on tblClicksURLs
. The constraint is on (url, url_active)
. Every time a new (click_id, url, url_active)
record needs to be added, this record is UPSERT'ed. On conflict url_active
is set to null
. So, all records with this url will have their url_active
set to null
.
I then use
RETURNING (xmax = 0) AS inserted
as discussed here to check if the record was inserted or upserted. If the record was upserted, I update the url_active
field of the latest record to true
.
In this way, if a url is being inserted for the first time, it happens in a single transaction. If the url exists, it happens via two transactions.