How does GreenPlum split data into different segments?

I would like to ask what rules does GreenPlum separate it's tables into segments.

Does it evenly separate data into segments or according to block size?

Or depends on other characteristic of the data.

Thanks!

Solution

Two different ways.

Distribution key

Example:

CREATE TABLE foo
(id int, bar text)
DISTRIBUTED BY (id);

This will spread the data the id column. You should pick a column or set of columns that will spread the data evenly across the database. If the table is very large and you join it to another table that is also very large, you may want to distribute both tables by the same keys.

Random

Example:

CREATE TABLE foo
(id int, bar text)
DISTRIBUTED RANDOMLY;

This distributes the data in a random fashion. Use this for small tables or if there isn't a natural key to the table.

You can also see how the distribution by using the hidden column "gp_segment_id".

select gp_segment_id, count(*) from foo group by gp_segment_id order by gp_segment_id;
 gp_segment_id | count 
---------------+-------
             0 |  1654
             1 |  1655
             2 |  1665
             3 |  1661
             4 |  1682
             5 |  1683