I have found that dot product is the same cycle with vector add, vector mul(just one cycle per ALU per core), but not the mad. So I'm curious to how many cycles mad instruction are.
I resort the dot product to improve OpenCL performance instead of mad, but I got bad performance. With mad, the consuming time of kernel in my project is 58ms(average, multiple times test, on arm mali G77 Bifrost). And 68ms with the dot product. So if you have a different conclusion, please attach it.