Shader's function parameters performance

I'm trying to understand how passing parameters is implemented in shader languages.

I've read several articles and documentation, but still I have some doubts. In particular I'm trying to understand the differences with a C++ function call, with a particular emphasis on performances.

There are slightly differences between HLSL,Cg and GLSL but I guess the underline implementation is quite similar.

What I've understood so far:

Unless otherwise specified a function parameter is always passed by value (is this true even for matrix?)
Passing by value in this context hasn't the same implications as with C++. No recursion is supported, so the stack isn't used and most function are inlined and arguments directly put into registers.
functions are often inlined by default (HLSL) or at least inline keyword is always respected by the compiler (Cg)

Are the considerations above right?

Now 2 specific question:

Passing a matrix as function parameter

inline float4 DoSomething(in Mat4x4 mat, in float3 vec) { ... }

Considering the function above, in C++ that would be nasty and would be definitely better to use references : const Mat4x4&.

What about shaders? Is this a bad approach? I read that for example inout qualifier could be used to pass a matrix by reference, but actually it implicates that matrix be modified by the called function..

Does the number (and type of arguments) have any implication? For example is better use functions with a limited set of arguments?Or avoid passing matrices? Is inout modifier a valid way to improve performance here? If so, anyone does know how a typical compiler implement this?
Are there any difference between HLSL an GLSL on this? Does anyone have hints on this?

Solution

According to the spec, values are always copied. For in parameters, the are copied at call time, for out parameters at return time, and for inout parameters at both call and return time.

In the language of the spec (GLSL 4.50, section 6.1.1 "Function Calling Conventions"):

All arguments are evaluated at call time, exactly once, in order, from left to right. Evaluation of an in parameter results in a value that is copied to the formal parameter. Evaluation of an out parameter results in an l-value that is used to copy out a value when the function returns. Evaluation of an inout parameter results in both a value and an l-value; the value is copied to the formal parameter at call time and the lvalue is used to copy out a value when the function returns.

An implementation is of course free to optimize anything it wants as long as the result is the same as it would be with the documented behavior. But I don't think you can expect it to work in any specify way.

For example, it wouldn't be save to pass all inout parameters by reference. Say if you had this code:

vec4 Foo(inout mat4 mat1, inout mat4 mat2) {
    mat1 = mat4(0.0);
    mat2 = mat4(1.0);
    return mat1 * vec4(1.0);
}

mat4 myMat;
vec4 res = Foo(myMat, myMat);

The correct result for this is a vector containing all 0.0 components. If the arguments were passed by reference, mat1 and mat2 inside Foo() would alias the same matrix. This means that the assignment to mat2 also changes the value of mat1, and the result is a vector with all 1.0 components. Which would be wrong.

This is of course a very artificial example, but the optimization has to be selective to work correctly in all cases.