I have a table I want to store filepaths in - so I have a varchar field, with a size of 4096 (the default MAX_PATH size in linux). However, I need to be able to do queries of all filepaths within a certain directory, so I was thinking of doing a query like:
SELECT *
FROM files_table
WHERE files_table.path LIKE "/my/awesome/dir/%"
When I run this on my database, with the path field UNINDEXED, it takes about 10s. Ok, I can see it taking a while, considering my table size is about 4 million, and it's an unindexed field. However, when I index it, with an index size of 500, the query time jumps... up to about 30s!
This seems very confusing to me. Does anybody have any ideas on what might be causing this?
For those hungry for more data:
As a bit of extra data - I tried running an "explain" on the query, and found that it IS, indeed, using my index but it reports the key_len as only 5! This seems strange, as well.
Also - while I would like to hear a good answer to my question (because I want to understand what's going on here!), I'm also open to ideas of the, "I dunno why it's doing that, but it doesn't matter, because you really should be designing your database like this..." sort. For those who lean that way, what I'm really trying to do is build a database structure to do queries of various (cached) data from a large networked filesystem. I know that just storing a filepath is probably the most naive way of approaching this, but I figured I'd try it out as a first-pass implementation, and see where it got me.
Edit:
So, a bit more digging / information: the actual index is a multiple-column index - the first index is an int, holding a batch_id (ie, the table holds cached information about the filesystem, so each snapshot gets it's own batch_id), and the second is my partial index for the path varchar. So when EXPLAIN says the index key_len, the first 4 bytes of that are actually for the batch_id - meaning it only has a one-byte index for the path!
Oh, and the "actual" query does restrict on batch_id too, so it looks more like this:
SELECT *
FROM files_table
WHERE batch_id=5
AND files_table.path LIKE "_globalSoft/my/awesome/dir/%"
Secondly - a large-ish percentage of the files in my database have a path that starts with "_" - "_globalSoft" in the query above being an example of one. (Yes, the paths are all relative.) So, if the key_len is only 5, it's possible that the only character being used in the key is the leading "_" - which would explain why it's so slow.
Of course, this still begs the question of why it's only using the leading "_". When reading the docs for MySQL indexes (http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html), I noticed this line:
Strings are automatically prefix- and end-space compressed. See Section 13.1.8, “CREATE INDEX Syntax”.
Unfortunately, the given link doesn't say anything about string prefix-compression, and I'm having a hard time finding much information about it. What information I have found is all about MyISAM, and I'm using InnoDB right now. (Though switching to MyISAM might make sense, since it's supposedly better with strings.)
FACEPALM
Ok, I'm an idiot... the problem was that I was matching directories like "globalSoft" - ie, directories that start with an underscore - and didn't realize that "" was a special character (like %), and didn't escape it.
Forgive my stupidity!