Search code examples
mysqlperformancejoinindexingentity-attribute-value

MySQL self join performance: fact or just bad indexing?


As an example: I'm having a database to detect visitor (bots, etc) and since not every visitor have the same amount of 'credential' I made a 'dynamic' table like so: see fiddle: http://sqlfiddle.com/#!9/ca4c8/1 (simplified version).

This returns me the profile ID that I use to gather info about each profile (in another DB). Depending on the profile type I query the table with different nameclause (name='something') (ei: hostname, ipAddr, userAgent, HumanId, etc).

I'm not an expert in SQL but I'm familiar with indexes, constraints, primary, unique, foreign key etc. And from what I saw from these search results:

Most of them have comments about bad performance on self-join but answers tend to go for the missing index cause.

So the final question is: is self joining a table makes it more prone to bad performance assuming that everything is indexed properly?


On a side note, more information about the table: might be irrelevant to the question but is well in context for my particular situation:

  • column flag is used to mark records for deletion as the user I use from php don't have DELETE permission over this database. Sorry, Security is more important than performance
  • I added the 'type' that will go with info I get from the user agent. (ie: if anything is (at least seems to be) a bot, we will only search for type 5000.
  • Column 'name' is unfortunately a varchar indexed in the primary key (with profile and type).
  • I tried to use as much INT and filtering (WHERE) in the SELECT query to reduce eventual lost of performance (if that even matters)
  • I'm willing to study and tweak the thing if needed unless someone with a high background in mysql tells me it's really not a good thing to do.

This is a big project I have in development so I cannot test it with millions of records for now but I wonder if performance will be an issues as this grows. Any input, links, references, documentation or test procedure (maybe in comments) will be appreciated.


Solution

  • A self-join is no different than joining two different tables. The optimizer will pick one 'table', usually based on the WHERE, then do a Nested Loop Join into the other. In your case, you have implied, via LEFT, that it should work only one way. (The Optimizer will ignore that if it sees no need for it.

    Your keys are find for that Fiddle.

    The real problem is "Entity-Attribute-Value", which is a messy way to lay out data in tables. Your query seems to be saying "find a (limit 1) profile (entity) that has a certain pair of attributes (name = Googlebot AND addr = ...).

    It would be so much easier, and faster, to have two columns (name and addr) and a "composite" INDEX(name, addr).

    I recommend doing that for the common "attributes", then put the rest into a single column with a JSON string. See here.