Search code examples
sqljsonpostgresqlquery-optimization

How many rows of data are deemed untenable and require a switch to a JSON column


Context

To describe the situation I find myself in currently, let's take the 3 tables relevant to this:

CREATE TABLE steam_user (
    id SERIAL PRIMARY KEY,
    username TEXT NOT NULL,
    steam_id TEXT NOT NULL,
    profile_url TEXT NOT NULL,
    avatar_url TEXT NOT NULL
);

CREATE TABLE achievement (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    display_name TEXT NOT NULL,
    description TEXT NOT NULL,
    icon_url TEXT DEFAULT '' NOT NULL,
    categories SMALLINT[] DEFAULT '{}' NOT NULL
);

CREATE TABLE achievements_on_steam_users (
    id SERIAL PRIMARY KEY,
    steam_user_id INTEGER NOT NULL,  /* foreign key */
    achievement_id INTEGER NOT NULL, /* foreign key */
    achieved BOOLEAN NOT NULL DEFAULT FALSE,
    unlock_time TIMESTAMP NOT NULL DEFAULT '1970-01-01 00:00:00'
);

I'm pulling achievements for a specific set of games from the Steam API and then connecting them together in the achievements_on_steam_users junction table.

The problem

achievements_on_steam_users gains over 1000 rows every time i pull achievement data for a user. This will happen once every time a new user's data is queried on the site. I pull the state of all achievements from steam and then save it to my database.

Use Case

One of the pages required me to query all of these entries for one specific user. Optimally I'd prefer to be able to JOIN this query with the achievement table. Furthermore I would like to support a case, where users can reset their achievement data (on Steam). This is why I would prefer to have rows with the achieved = FALSE state in them and just update as I go, instead of only adding completed achievements to this list, that I'd then have to delete.

These are the two main queries that I'm worrying about:

SELECT *
FROM achievements_on_steam_users
WHERE steam_user_id = $1 /* some id */
QUERY PLAN
Seq Scan on achievements_on_steam_users (cost=0.00..25.60 rows=1328 width=21)
Filter: (steam_user_id = 1)
SELECT *
FROM achievements_on_steam_users
INNER JOIN achievement ON achievements_on_steam_users.achievement_id = achievement.id
WHERE steam_id = $1 
QUERY PLAN
Hash Right Join (cost=42.20..107.98 rows=1328 width=282)
Hash Cond: (achievement.id = achievements_on_steam_users.achievement_id)
-> Seq Scan on achievement (cost=0.00..62.28 rows=1328 width=261)
-> Hash (cost=25.60..25.60 rows=1328 width=21)
-> Seq Scan on achievements_on_steam_users (cost=0.00..25.60 rows=1328 width=21)
Filter: (steam_user_id = 1)

Keep in mind that these times aren't indicative of a lot, since I currently only have test data (my own) in the database.

Question

Would it make sense to instead store these values in a json column directly inside of the steam_user table. I would define a schema similar to:

{
  //This could be any ID from the `achievement` table
  "333": {
    "achieved": true,
    "unlockTime": "1970-01-01 00:00:00"
  }
}

For those unfamiliar with steam achievements; the data inside of the achievement table changes very rarely and even then the IDs would never change. At best new IDs would get added.

Is that a good idea? Maybe someone more familiar with this volume of data can tell me if the queries would considerably slow down in this case?

For now all of my mitigations have been on the code side and I've worked around this, as of now, non-issue. However I'd like to mitigate the amount of rows before this actually gets released publicly, since I feel like the verbosity is unnecessary.


Solution

  • Would it make sense to instead store these values in a json column directly inside of the steam_user table.

    Very unlikely, unless your data is very dynamically defined and is rarely queried in its parts, only as a whole.

    The JSON schema would have a far larger storage footprint. Your current schema incidentally can be improved by removing the unnecessary id column, and making the two FKs into a composite primary key.

    CREATE TABLE achievements_on_steam_users (
        steam_user_id INTEGER NOT NULL,  /* foreign key */
        achievement_id INTEGER NOT NULL, /* foreign key */
        achieved BOOLEAN NOT NULL DEFAULT FALSE,
        unlock_time TIMESTAMP NOT NULL DEFAULT '1970-01-01 00:00:00',
        PRIMARY KEY (steam_user_id, achievement_id)
    );
    

    Your current schema would take per row, and after calculating padding, 24 header bytes plus 24 bytes of data, for a total of 48 bytes per row, or 48000 bytes per user. Incidentally, Postgres is particularly bad for small rows, some DBMSs are much more efficient.

    The JSON schema on the other hand, requires much more. It's hard to tell without actual data, but a quick check using a fiddle gives about 70kb for a single user. On top of that, it's much less efficient to query a single value.

    So it only really makes sense to use this type of JSON structure when the whole blob is being queried always (or almost always), not when you need individual values. Even then, it's still against first principles of database design, because referential integrity can't be enforced, and would only be advised if you already have JSON or if performance demanded it for some reason.


    What I don't really understand is why you need all those rows. It sounds as if you only actually need to know when that achievement_id was attained by the user, so you only need to store rows where achieved is true. Then you can have the schema:

    CREATE TABLE achievements_on_steam_users (
        steam_user_id INTEGER NOT NULL,  /* foreign key */
        achievement_id INTEGER NOT NULL, /* foreign key */
        unlock_time TIMESTAMP NOT NULL DEFAULT '1970-01-01 00:00:00',
        PRIMARY KEY (steam_user_id, achievement_id)
    );
    

    Or for a JSON schema you can just use:

    {
      //This could be any ID from the `achievement` table
      "333": "1970-01-01 00:00:00"
    }