Search code examples
mysqlarchitecturemariadbmany-to-many

Alternative to many-to-many (junction) table to reduce size


My project https://simbiat.ru/fftracker has a table, that is a junction table containing IDs of achievements earned by each character ID. It works fine, but my concerns is its size: before optimization it was 6.1 GBs and after it - 4 GBs. And it's not the size of the index

Data    3.1 GiB
Index   901.5   MiB
Overhead    2.5 MiB
Effective   4.0 GiB
Total   4.0 GiB

I do not see any performance impact of it now, but I am concerned about the size, especially since the table is updated frequently and grows larger and larger. Even though it has just 3 columns (definition generated by phpMyAdmin):

CREATE TABLE `ff__character_achievement` (
  `characterid` int(10) UNSIGNED NOT NULL COMMENT 'Character ID taken from Lodestone URL (https://eu.finalfantasyxiv.com/lodestone/character/characterid/)',
  `achievementid` smallint(5) UNSIGNED NOT NULL COMMENT 'Achievement ID taken from Lodestone (https://eu.finalfantasyxiv.com/lodestone/character/characterid/achievement/detail/achievementid/)',
  `time` date NOT NULL COMMENT 'Date when achievement was received according to Lodestone'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;


ALTER TABLE `ff__character_achievement`
  ADD PRIMARY KEY (`characterid`,`achievementid`) USING BTREE,
  ADD KEY `ach` (`achievementid`) USING BTREE;


ALTER TABLE `ff__character_achievement`
  ADD CONSTRAINT `char_ach_ach` FOREIGN KEY (`achievementid`) REFERENCES `ff__achievement` (`achievementid`) ON DELETE CASCADE ON UPDATE CASCADE,
  ADD CONSTRAINT `char_ach_char` FOREIGN KEY (`characterid`) REFERENCES `ff__character` (`characterid`) ON DELETE CASCADE ON UPDATE CASCADE;
COMMIT;

Considering that original tables are 408KBs and 707MBs and fill with lots of other data (varchars and text even), I would expect the junction table to be 2GBs with indexes at most. Yes, I can remove the date column, but it does not help much: 2.8 instead of 3.1GBs. Which seems weird to me, but that's the values reported by MariaDB.
The purpose of the table is to dynamically determine the rarest achievements as seen on https://simbiat.ru/fftracker/statistics/achievements/ and also generate random list of characters that have the achievements as seen on https://simbiat.ru/fftracker/achievement/1/ (as example). I think that for this purpose 4GBs for a table is too excessive.
So my question is, whether there may be some kind of way to store the same data and easily utilize it for the same purpose, but have smaller footprint?
From what I know it may be possible to store achievement IDs for each character as a JSON string, but I will definitely lose CONSTRAINT (which I prefer to retain) and I do not see an easy to use way to search through respective JSONs, unless doing a WHERE or FULLTEXT search, which does not seem to be appropriate for this.
Any suggestion will be appreciated.


Solution

  • Analysis of Size

    The 3 columns take this much for the data BTree and PRIMARY KEY:

    • 4 bytes for INT
    • 2 bytes for SMALLINT
    • 3 bytes for DATE
    • Total, including overhead, is about 40 bytes.

    Since the lone secondary index has two of those 3 columns, I would not expect it to be much smaller, so I am surprised by the ratio of 3.1 : 0.9 GiB

    3.1GiB/40B ~= 80M -- Is that about how many rows in that table? So, my 40B guestimate is a bit high, but Bill's was a bit low. Still we are all "in the same ballpark" for this kind of computation. Caveat: The 142M is actually an approximation; the real value could easily be Bill's 270M or my 80. The real count is available only from COUNT(*) unless you are running a version of MariaDB that gives the exact number.

    The FKs may be forcing an extra index on the reference tables, but do not impact this table (except for slowing down INSERTs).

    Possible workaround

    How many different "achievements" are there? If less than 64, we can discuss the use of SET or BIGINT UNSIGNED. Such a column could be added to the player table and consume only 8 (or fewer) bytes per player. Since I don't know what you are doing with the data, I can't be more specific. See also 1 << n, &, BIT_COUNT().

    Please show us the SELECT that needs optimizing.

    Another approach

    Invent 3000 distinct 3-letter words. Have a TEXT column for each of the 142M rows; it would contain the relevant 'words'. Add FULLTEXT(col) to build a fulltext index of those "words". Now searching 142M rows for one or a small number of "id" "words" would be very efficient.

    The space for the column would be 4 bytes per id per row (plus some overhead).