Search code examples
mysqlimagehashmicrotime

Which is better Storing hash value or the bigint variable from which the hash value is generated


I have a table in which a column stores image src which is in hash value and that hash value is generated from microtime(),Now I have two choice storing directly hash value in database or storing that bigint microtime from which the image name is derived.Which would make my db faster.


Solution

  • We have to analyze this from all sides to asses what speed faults are incured.

    I will make a few assumptions:

    • this data will be used as an identifier (primary key, unique key, composite key);
    • this data is used for searches and joins;
    • you are using a hashing algorithm such as SHA1 that yields a 40 character string of hex encoded data (MD5 yields a 32 character string of hex encoded data all said bellow can be adapted to MD5 if that's what you're using);
    • you may be interested in converting the hex values of the hash into binary to reduce the storage required by half and to improve comparison speed;

    Inserting and Updating on the application side:

    As @Namphibian stated is composed of 2 operations for the BIGINT versus 3 operations for the CHAR.

    But the speed difference in my opinion really isn't that big. You can run 10.000.000 continuous calculations (in a while loop) and benchmark them to find out the real difference between them.

    Also a speed difference in the application code affects users linearly, while speed differences in the DB affect users nonlinearly when traffic increases because overlapping writes have to wait for each other and some reads have to wait for writes to finish.

    Inserting and Updating on the DB side:

    Is almost the same for a BIGINT as it is for a CHAR(40) or a BINARY(20) because the more serious time consumption is done waiting for access to the disk rather than actually writing to it.

    Selecting and Joining on the DB side:

    This is always faster for a BIGINT compared to a CHAR(40) or a BINARY(20) for two reasons:

    • BIGINT is stored in 8 bytes while CHAR(40) is stored in 40 bytes and BINARY(20) in 20 bytes;
    • BIGINT's serially increasing nature makes it predictable and easy to compare and sort.

    Second best option is the BINARY(20) because it saves some space and it is easier to compare due to reduced length.

    Both BINARY(20) and CHAR(40) are the result of the hashing mechanism and are randomized, hence comparing and sorting takes a longer time on average because randomized data in indexes (for a btree index) needs more tree traversals to fetch (i mean that in the context of multiple values, not for one single value).