database-design primary-key firebird database-performance sqldatatypes

Performance, space and other observations when using VarChar or Char instead of Integer in Firebird

I'm frequently using columns of type Integer to specify some "type" of a record, so 0=TypeA, 1=TypeB and so on. I can put a translation into the column description, but the value itself is not really human readable. This column can also be indexed.

Another case is that sometimes records have some short unique alphanumeric string as identifier. In addition I'm adding an Integer ID as primary key (and for that name only a unique index), because I was told primary keys should never be strings for performance reasons. After I got the ID from the name, I'm using it where possible.

So the question is, what performance, space and other observations can you tell me about using strings directly where I could use Integers? Those strings can be short, like 8 chars or even less and 7-bit-ASCII would be enough.

This question is mainly for Firebird 3.0, but I am also very interested in a general rule or how Firebird compares to other databases like MSSQL.

This question is not about using natural rather than surrogate primary keys.

Solution

There are no solid rules, it will always be a trade off. Don't generalize too much, and consider the use case, ease of use and potential overhead

Integral numeric values are usually preferred because they lead to smaller indexes. Numeric indexes (except 64 bit integers) in Firebird use 64 bit double precision, which means that (ignoring optimizations like prefix and suffix compression, etc), an entry in the index for a number will be 8 bytes.

For characters it will be the maximum declared length in number of characters x max bytes per character in the character set, so a CHAR(8) CHARACTER SET UTF8 will consume 8x4=32 bytes, while a CHAR(8) CHARACTER SET WIN1252 consumes 8x1 = 8 bytes in the index (again ignoring some optimizations).

In other words, if those strings you use are 8 characters or less using a single byte character set (and in this case, even 7 bit character sets like ASCII will still use the full byte!), then there will be no significant difference between using an integer or the string value.

In addition, if the primary key will be targeted by a foreign key from another table, and you use a surrogate, you are now required to add a join only to be able to lookup that code, while it would be immediately available if you had just used it directly as the primary key. The performance impact of having to do extra joins could outweigh any expected performance 'benefits' of using a surrogate.

As with everything in software development, it is a trade off. If the string code remains small like 8 or a few more characters, and you know the string code won't need to change (cascading updates, etc), then by all means: choose the string code.

On the other hand, if the string code will need to change regularly (which would lead to cascading updates), use a surrogate. If the string code gets larger than say 16 bytes, consider implementing both and performing realistic(!) performance tests. If the string code is very long (say 50 bytes or more), lean towards using a surrogate.