database data-structures indexing primary-key b-tree

Why is a primary key index an additional structure?

I am reading that RDMS stores table data on disk in some form of B-tree, and also that table indexes are stored in the B-tree form.

I read that primary key index is created automatically for a primary key defined, but that it could also be dropped anytime. So, it implies that primary-key index is an additional structure next to the B-tree used for just storing table data.

Isn't that wasting of resources - why wouldn't all table table be kept through primary-key index?

If it isn't like that, which order is then used for the B-tree used to store table data?

Thanks for clarifying

Solution

The primary key index is an optimization for finding the place on disk where the row is held. As a structure, it contains simply the PK data, not the whole row.

On a database, performance is often gated by how many pages are read from disk vs. cache. Since the PK index is smaller than the whole table, it is more likely to be in cache, it causes fewer blocks to be read from disk, and less blocks of other tables are removed from cache. It therefore is a major performance optimization.

Further, while modifying the table data, rows are locked. If the primary key were being scanned from the table data on disk, locked rows would slow access for all the other queries. By separating the index as a separate structure, the index can be used even while the row being pointed to is locked.

So overall, the separate PK structure is a classic space-for-time optimization.

EDIT What is the order of the rows in the table? The following answer is for Oracle, but is applicable to many databases.

Short answer: rows are not ordered on disk which is why the PK index (and other indexes) are so important.

Long answer:

While the primary-key b-tree structure is necessarily sorted (the b-tree) the rows of the table are scattered across the table-space. To understand this we need to drill down the various data structures.

First, the database is structured into logical entities called a tablespaces. A tablespace is the space in one or more files on one or more disks. The files start empty. When the tablespace become full (technically when the data in it reaches a threshold) the tablespace can be automatically grown. It can also be grown manually by enlarging the file (adding an 'extent', or adding new files). Tablespaces can be clustered across multiple machines as well as disks.

Second: A tablespace is divided segments, each segment for the use of a single table or index.

Third: The segment is divided into blocks, each block has space for one or more rows. These blocks are not the same as disk or OS blocks; Oracle blocks are one or more OS blocks. (This is for transportability, and for managing media with different block sizes).

On insert, a database will select a space in a block from anywhere in the tablespace. The row can be inserted sequentially (especially bulk inserting into an empty table), but normally the database will also reuse space where rows have been deleted or moved due to some types of update. While the placement is theoretically kind-a predictable, in practice you should never rely on or expect the row to be placed in any specific block.

One interesting thing in Oracle is the ROWID. This is the reference stored in the index that allows the DB to look up the row:

An extended rowid has a four-piece format, OOOOOOFFFBBBBBBRRR:
The first 6 characters OOOOOO represent data object number, using 32bits
The next 3 characters FFF represent tablespace-relative datafile number, using 10bits.
The next 6 characters BBBBB represent block number, using 22bits.
The last 3 character RRR represent row number, using 16bit

For much more detail, see http://docs.oracle.com/cd/E11882_01/server.112/e25789/logical.htm#autoId0

One other thought: There is a concept in the DB world called partitions, where a dataset is divided across different tablespaces (frequently different disks or nodes in a cluster) depending on some expression logic. For example, on a table of customers, a vertical partition could be defined by the country of the person. That way you can ensure that the US customers are physically on one disk while the Australians are on another.