Search code examples
postgresqldatabase-performanceupsert

Update and Insert vs Upsert in PostgreSQL


Say, I am building a camera app. Every time the user clicks a photo, the image is stored on the cloud. Because I want to restrict how many images are stored on the cloud, the app gets 10 URLs in an array called listURLs when it is initialised.

The first 10 clicks get PUT into the cloud, exhausting listURLs. Then, every time a click happens, a coin toss determines whether the latest click replaces an existing click on the cloud. Typical numbers would be 50 clicks, first 10 clicks get assigned a URL, and of the remaining 40 clicks, 20 of them overwrite an existing URL.

I store records of each app session in a Postgres DB. Each session will have an ID and all instances of clicks (which may or may not have a corresponding url). I also need to know the url corresponding to each click, if one exists. So, if there are 30 clicks, I will need to know which 10 of these have a corresponding url.

I can think of two ways of storing this data.

  1. tblClicksURLs as a Table that has click_id, url and url_active as its fields. Every time a click_id and non-null url need to be inserted, update all other records with the same url to have url_active as false.

  2. Two tables tblClicks and tblURLs. tblURLs has a click_id foreign key. Every time a click_id and non-null url need to be inserted, the click_id gets inserted into tblClicks and click_id and url get upserted into tblURLs. The upsert is based on whether the url already exists in tblURLs. So, for a given url, there will only be one click_id in tblURLs

So, in Case 1, I will have an UPDATE of url_active followed by INSERT on the same table. In Case 2, I will have an INSERT into one table and an UPSERT into another. I will need indexing on click_id, but not on url.

If you are looking at writes of > 10k rows per second, maybe even more, which of these two would be more efficient? Assume that the numbers per session are similar to the one quoted above (50 clicks, etc.)

I could also register a created_at datetime for each record in Case 1, and just use the first non-null url ordered reverse-chronologically. But, I am trying to avoid this, unless the performance benefits are enormous.


Solution

  • After thinking about it for a while, I decided to use UPSERT along with a Unique Constraint on tblClicksURLs. The constraint is on (url, url_active). Every time a new (click_id, url, url_active) record needs to be added, this record is UPSERT'ed. On conflict url_active is set to null. So, all records with this url will have their url_active set to null.

    I then use

    RETURNING (xmax = 0) AS inserted
    

    as discussed here to check if the record was inserted or upserted. If the record was upserted, I update the url_active field of the latest record to true.

    In this way, if a url is being inserted for the first time, it happens in a single transaction. If the url exists, it happens via two transactions.