Can I "optimize" a query involving a text column?

If I have this 400M rows table:

create table my_table
( id int,
  filled tinyint,
  content text not null
) engine=myisam

can I prevent the extra access of content (due to column type text) with:

select id,if(filled,content,'') from my_table

(filled is 0 or 1)?

Or if content can be null, would

select id,ifnull(content,'') from my_table

be better (still in term of performance since the null overhead would then occur)?

Solution

Your question is imprecise. Might a row have filled = 1 and content = 'some text'?

If yes, then the IF(filled, content, '') makes sense.

If no, then get rid of filled and simply set content to ''. The cost of accessing an empty string for content is tiny.

How big is content typically? If is usually only hundreds of characters, the performance difference is small.

Don't use MyISAM, switch to InnoDB.

Do you have a lot of DELETEs or UPDATEs? If so, the .MYD file of MyISAM may be fragmented. This implies that content may be scattered around the disk. If you have no updates or deletes, then content is "right there", hence virtually free to access.

In contrast, InnoDB sometimes puts TEXT and BLOB "off-record".

But, which is better or worse? The off-record (in some situations) versus the fragmentation (in some situations).

Are you fetching just one row? Or lots of such rows? This is another area where MyISAM and InnoDB are implemented quite differently. MyISAM needs to step over content to get to the "next" row. (I won't go into a lengthy discussion of where the "next" row is likely to be.)

Bottom line: Don't worry about optimizing, but do change to InnoDB (for a lot of reasons).

Where is the "next" row likely to be?

MyISAM... Rows are added to the .MYD in the order of inserts. But, after updates or deletes, new/changed rows tend to fill in the gaps first, leaving appending to the file as a last resort.

InnoDB... Rows are in a B+Tree ordered by the PRIMARY KEY. The BTree is blocked with 16KB blocks. You have a TEXT column, so it gets more complex...

For "small" TEXT columns, several rows will fit into 16KB. If, say, content is 1KB strings, up to about 15 rows will be put in a block. If they are much bigger, then content will be stored "off-record".

The rows in a block are essentially all equidistant from each other. (The big effort is loading the block into cache and locating the block.)

The "next" (or "previous") block is just a disk access away -- see the "+" in "B+Tree".

When a block overflows, it is "split" into two blocks, some rows go into each. The new block is allocated somewhere, rarely adjacent.

With SSDs, all blocks are "equidistant" from each other. With HDD, rotational delays and arm motion complicate the timing of which block might be "closer" than some other block.