assembly x86-16 emu8086 memory-segmentation

Are segments supposed to overlap? With an empty data section CS and DS get the same value

All my (extremely basic) Assembly programs are written following this schema:

.model small
.stack 0100h
.data
; insert variables here

.code
.startup

; insert instructions here

.exit
END

Today I noticed that the address of the Code segment and the Data segment are overlapping: when there are variables declared in the .data section, the Code segment is "pushed" down in memory to reserve space for the variables. If there is no variable declared, the DS and CS register have the same value. I follow the breadcrumbs of this (odd, for me) behaviour to the .startup directive, which sets the DS value to the same of the CS register.

This means that when I access, for example, to the [SI] location, I'm actually touching the Code segment.

Now I wonder: is this supposed to happen? I remember that the Stack and Data segments might share the same space in memory (usually one at the begin and one at the end), but I thought that the code and data segment should never interfere with one another...

Can anyone enlight me on this?

Solution

Yes, that segmentation scheme allows for overlap — you can use it or not! Lots of different "model"s were used.

Let's say, for example, that we have a code segment followed by a data segment. Using the minimum gap between them, the furthest the data segment can follow the code segment is 15 bytes, so on average program we might argue that ~8 bytes are wasted between the code segment and data segment — and also that certain 16-bit code references (offsets in x86 segmentation parlance) refer to code while others refer to (some of the) data (this is very hard to take advantage of though).

If you want these two segments to not overlap, then you have to round up the code segment to 64k size. That way, 16-bit code references will either be within the actual code or will be in that gap added to make it round up to 64k.

Doing this is more of a waste of memory than some kind of advantage, though.

Basically, the segmentation scheme allows for references of up to 64k from CS, and also up to 64k from DS, using a 16-bit pointer, while still allowing for only a smallish gap between code and data segments, meaning more memory for data.. You really don't want gaps in a small address space as they are hard to take advantage of.

It isn't really a security mechanism, though.

At the time, a language like Pascal kept function pointers and data pointers quite separate from a type system perspective, so this scheme would work well for this class of languages (not to mention assembly language, still very popular back in the day). C made things a bit harder though, since void * could potentially be used for both code and/or data pointers (you cannot tell whether a 16-bit pointer originally referred to code vs. to data by inspecting its value), confusing things somewhat (perhaps for debuggers), though programmer would generally know which was which.

The choice of 20 bits in total (resulting in 12 bits of overlap) is somewhat arbitrary, I think, but has trade offs. Going from 2^16 to 2^20 added 4 address lines to every processor (and perhaps also motherboards), even when populated with far less than the 1MB maximum of memory. And I suppose the designers felt that going to a 16x larger address space was rather enough for the time.