Here's my EAV table structure (I know EAV is bad but number of attributes I need to store are in over ten thousand, so a normalized table isn't going to work)
Table name - propertyAssign
entityId - int - indexed
attributeId - smallint - composite index with valueId
valueId - smallint - composite index with attributeId
I only need to query this table in two ways.
So I've added indexes accordingly.
Question => Do I need to add any primary key in here?
Renzo's answer covers Select #1, but not
select entityId from propertyAssign where attributeId=x and valueId=y
That needs
INDEX(attributeId, valueId, entityId)
It will be
WHERE
clause, andINDEX
contains all the fields needed ("Covering index").Yes, that essentially doubles the size of the table (data+PK, then index that contains all the data). But it is much better than having to do a table scan for Select #2.
It smells like attributeId
and valueId
are links to 'normalization' tables that have the actual string and value?? Where's the JOIN
needed to complete the code? If you are doing it in a separate SELECT
, then that is less efficient than a JOIN
because it is two (or three?) roundtrips to the server.
EAV is a really bad design pattern; good luck.
Edit
The two SELECTs
mentioned will benefit from these two indexes:
INDEX(entityId, attributeId, valueId) -- for Select #1
INDEX(attributeId, valueId, entityId) -- for Select #2
And, since that triple is UNIQUE, one or the other INDEX
may as well be the PRIMARY KEY
. Now, to pick which...
When INSERTing
, having the PK start with entityId
makes it so that all the triples for an entity are 'clustered' together. This speeds up both the INSERT
and SELECT
#1. So I vote for it being the PK. Making the other one the PK will not speed up the INSERTs
. This is because the creation of an entity with lots of attributes will lead to lots of scattered writes.
Each of the two SELECTs
is optimally handled by one or the other index; so the SELECTs
are as fast as possible. Well, I am ignoring that fact that you normalized the attribute names and values. This my bite you later, and make for uglier queries.
I say it is a bad design partially because of a benchmark of a very similar schema. The stress test populated the tables with more than could be cached. The insert rate could not go beyond 7 entities per second. And that was with the RAID-striped disks running at full capacity. The normalized attributes, etc, were leading to lots of random disk hits.