Why exactly does it not work to use a renderbuffer on a layered framebuffer? I read that if you want to have the depth values you have to use extra textures. Is that so? Or is there another option to use a renderbuffer?
Yes, that is so. Why? Well, the simple answer is that it's defined that way. From the OpenGL spec document, section "Whole Framebuffer Completeness":
If any framebuffer attachment is layered, all populated attachments must be layered. Additionally, all populated color attachments must be from textures of the same target (three-dimensional, one- or two-dimensional array, cube map, or cube map array textures).
Since there are no layered renderbuffers, you can't use them for layered rendering.