The way I'm doing it now, as I understand it every new image has to be created with LAYOUT_UNDEFINED. Then I place a barrier to transition it to TRANSFER_DST_OPTIMAL. Now I can transfer the image. Unless you want to use LINEAR TILING, that is what you need to do at the very least.
Now I need to transfer the data to the buffer. In my case the host-visible memory isn't device-local, so I write the image data to a staging buffer, then issue a buffer to image copy and transition from DST_SRC to READ_OPTIMAL. This is THE most efficient way to do it if you host-visible memory isn't device-local.
Now consider that host-visible memory is also device-local. I feel that I should be able to write to the GPU device-local buffer immediately, but I don't know how or even if it's possible, because I think you always need a buffer to image copy. Is that right?
As long as you're using optimal tiling, image data needs to be sent through a Vulkan controlled process before the GPU can use that data. This is done via a copy operation. So even if you're loading raw data to device local memory, you still need to copy that data into the actual image via the usual operations.
Staging is not optional for optimally tiled images. Even if the staging buffer is in device-local memory.