Search code examples
mysqlmariadbentity-attribute-value

Does an Entity–Attribute–Value table (EAV table) need a separate ID field as primary key or composite primary key (like entityId+attributeId)?


Here's my EAV table structure (I know EAV is bad but number of attributes I need to store are in over ten thousand, so a normalized table isn't going to work)

Table name - propertyAssign

entityId - int - indexed
attributeId - smallint - composite index with valueId
valueId - smallint - composite index with attributeId

I only need to query this table in two ways.

  1. select attributeId, valudId from propertyAssign where entityId=x
  2. select entityId from propertyAssign where attributeId=x and valueId=y

So I've added indexes accordingly.

Question => Do I need to add any primary key in here?


Solution

  • Renzo's answer covers Select #1, but not

    select entityId from propertyAssign where attributeId=x and valueId=y
    

    That needs

    INDEX(attributeId, valueId, entityId)
    

    It will be

    • Efficient because it perfectly handles the WHERE clause, and
    • Even more efficient because the INDEX contains all the fields needed ("Covering index").

    Yes, that essentially doubles the size of the table (data+PK, then index that contains all the data). But it is much better than having to do a table scan for Select #2.

    It smells like attributeId and valueId are links to 'normalization' tables that have the actual string and value?? Where's the JOIN needed to complete the code? If you are doing it in a separate SELECT, then that is less efficient than a JOIN because it is two (or three?) roundtrips to the server.

    EAV is a really bad design pattern; good luck.

    Edit

    The two SELECTs mentioned will benefit from these two indexes:

     INDEX(entityId, attributeId, valueId) -- for Select #1
     INDEX(attributeId, valueId, entityId) -- for Select #2
    

    And, since that triple is UNIQUE, one or the other INDEX may as well be the PRIMARY KEY. Now, to pick which...

    When INSERTing, having the PK start with entityId makes it so that all the triples for an entity are 'clustered' together. This speeds up both the INSERT and SELECT #1. So I vote for it being the PK. Making the other one the PK will not speed up the INSERTs. This is because the creation of an entity with lots of attributes will lead to lots of scattered writes.

    Each of the two SELECTs is optimally handled by one or the other index; so the SELECTs are as fast as possible. Well, I am ignoring that fact that you normalized the attribute names and values. This my bite you later, and make for uglier queries.

    I say it is a bad design partially because of a benchmark of a very similar schema. The stress test populated the tables with more than could be cached. The insert rate could not go beyond 7 entities per second. And that was with the RAID-striped disks running at full capacity. The normalized attributes, etc, were leading to lots of random disk hits.