Cached/uncached memory mmap: impact on Neon

I have a camera connected to a cortex-A9 OMAP4 board. The video v4l2 frames are allocated in the 3.4 kernel with:

static int vb2_dc_mmap(void *buf_priv, struct vm_area_struct *vma)
{
    struct vb2_dc_buf *buf = buf_priv;

    if (!buf) {
        printk(KERN_ERR "No buffer to map\n");
        return -EINVAL;
    }

    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
    return vb2_mmap_pfn_range(vma, buf->dma_addr, buf->size,
                  &vb2_common_vm_ops, &buf->handler);
}

I have also tested:

vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);

I have a complex post-processing assembly Neon-based algorithm running on each frame. It accesses the frame through a standard v4l2 architecture with:

mmap(NULL, buf.length, PROT_READ | PROT_WRITE, MAP_SHARED, camera->fd, buf.m.offset);

Performance of this optimized algorithm is the following:

x ms:       user-space malloc allocation of a fake frame (reference)
10*x ms:    kernel allocation with pgprot_noncached
4*x ms:     kernel allocation with pgprot_writecombine
x ms:       kernel allocation with no pgprot call

The problem is that if I don't do any pgprot_*, I have some very strange noise, aka. a few consecutive black pixels randomly in the video. The noise disappears upon some specific circumstances when all allocated memory ranges are accessed.

Last, if I simply do a memcpy while memory has been allocated with the original pgprot_noncached, there doesn't seem to be any performance issue but I can't afford to add a memcpy.

How can I fix this situation, aka. get a kernel memory allocation without any noise and that is as good as a user-space malloc.

The neon code does a lot of vld1.u8 and vst1.u8 with different increments.

Solution

For reference, the solution was to invalidate and flush the memory region (outer_inv_range and outer_flush_range).