Coalesced memory access performance

I've read about coalesced memory access(In CUDA, what is memory coalescing, and how is it achieved?) and its performance importance. However I don't know what a typical GPU does when a non coalesced memory access occur. When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread? If the reading is aligned can I read the other 127 bytes for "free"?

Solution

General rules:

memory access instructions are issued warp-wide, just like any other instruction
each thread in a warp provides an address to read from
assuming these addresses don't "hit" in any of the caches, the memory controller collects all addresses and determines how many "segments" (roughly analogous to a cacheline) are required from DRAM. A "segment" is either 32 bytes or 128 bytes, depending on cache and device specifics.
the memory controller then requests those lines/segments from DRAM

If a single thread generates an address that is not near any of the other addresses generated in the warp, then the memory controller will need to request a whole line/segment from DRAM, which may be either 32 bytes or 128 bytes, depending on device and which caches are involved (i.e. what type of "miss" occurred) just to satisfy that one address from that one thread. Therefore regardless of whether that thread is requesting a minimum of 1 byte or up to the maximum of 16 bytes possible in a single thread read transaction, the memory controller must read either 32 bytes or 128 bytes from DRAM to satisfy the read originating from that thread. Similar logic will apply to every other address emanating from that particular "warp read".

This type of scattered or isolated access pattern is "uncoalesced", because no other thread in the warp needs an address close enough so that it can fulfill its needs from the same segment/line.

When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread?

Yes, either 32 bytes or 128 bytes is the minimum granularity of request that can be made from DRAM.

If the reading is aligned can I read the other 127 bytes for "free"?

Whether you need it or not, and regardless of alignment of requests within the line/segment, you will get either 32 bytes or 128 bytes from any DRAM read transaction.

This doesn't cover every case, but a general breakdown of the 32byte/128byte difference is as follows:

cc2.x devices have an enabled L1 cache, and so a cache "miss" will generally trigger a read of 128 bytes
cc3.x devices have only L2 cache enabled (for global memory transactions) and the L2 cacheline size is 32 bytes. A "miss" here will require a 32-byte load from DRAM, but a fully coalesced read across a warp will still ultimately require a load of 128 bytes (for int or float, for example) so ultimately four L2 ~~cachelines~~ "sectors" will still be needed. (There is no free lunch.)
cc5.x devices once again have the L1 enabled, so should be back to needing a full 128 byte load on a "miss"

This presentation will be instructive. In particular, slide 17 shows one example of "perfect" coalescing, whereas slide 25 shows an example of a "fully uncoalesced" load.