Search code examples
mysqldatabase-designdatabase-performance

SQL performance for string field vs multiple int/varchar fields


I'm trying to get the database design right, and I'm uncertain about which of 2 options I should choose. There will be around 20 bool values that are used to filter (4 in the sample for simplicity).

  • prop1
  • prop2
  • prop3
  • prop4

Option 1)

A int(1) or varchar(1) field for each prop. The filter part of the query could be like

WHERE prop1=1 AND prop3=1 AND prop4=1

Option 2)

Using a single text field with characters representing the props

  • a: prop1
  • b: prop2
  • c: prop3
  • d: prop4

Then the filter part of the query similar to option 1 would be like

WHERE props LIKE '%a%' AND props LIKE '%c%' AND props LIKE '%d%'

or if the characters are sorted:

WHERE props LIKE '%a%' AND props LIKE '%cd%'

My thought is that option 2 makes it easier to add new props so I like this option, but will the LIKE comparator perform worse than the equality comparator? Does it make any difference to use a single text compared to multiple int(1) or varchar(1)? Any other benefits or drawback that I haven't thought of?


Solution

  • The primary question is whether you can run faster than scanning the entire table. The answer is "no" unless a small number of the booleans can be handled separately with Index(es).

    Your WHERE bools LIKE '%a%c%d%' is a clever trick for ANDing any number of flags together. However, it will need to look at every row, and LIKE is slightly heavyweight.

    INT(1) takes 4 bytes plus overhead. TINYINT is what you are fishing for; it takes 1 byte, plus overhead.

    A SET with up to 64 bools is another technique. The coding is a bit clumsy, but it is rather efficient

    INT UNSIGNED (for up to 32) or BIGINT UNSIGNED (for up to 64) flags is implemented similarly to SET and also takes up to 8 bytes. But the coding is rather clumsy. Let's number the bits starting with 0 in the least significant bit.

    WHERE (bools & ( (1 << 0) | (1 << 2) | (1 << 3) ) ) = 
                   ( (1 << 0) | (1 << 2) | (1 << 3) )
    

    would check that bits 0, 2, and 3 are all set. (This is like your test for a,c,d.) A variety of ANDs and ORs are possible with this approach. (You could pre-compute those bit values-- 13 in this example. Or use a bit literal: 0b1101.)

    The benefit of SET or bits in an INT is the 'speed' within each row. Still, all rows must be tested.

    So, I recommend triaging your bools, etc, and decide what needs indexing and what can go into this combined column or in a combined JSON column for non-bools.