function machine-learning statistics sampling

Sampling from a high dimensional function

I have a function f that takes N real-valued inputs and is very expensive to compute. Each of the N inputs, call one by n, has a range of values (n_min, n_max) that it can take on. I am interested in studying the properties of f by generating samples on various inputs and seeing what outputs it produces. (The goal is to use ML to build an approximator for f.)

Let's say that due to time constraints I can only generate 1000 samples. Is it "better" to select the set of Ns that I feed to f by

(A) iterating from n_min to n_max with a big enough step size for each n, or

(B) uniformly sampling each n on the range of (n_min, n_max)?

Choice (A) has the desirable property of keeping every other input fixed while varying only one value at a time, but choice (B) has the desirable property of likely exploring more parts of the input space.

Solution

B is better when the function does not have equal variance over all the inputs, which it probably does not. In the extreme case, imagine you have 1000 samples, 3 inputs, but only one of them actually affects the function. If you sample over a 10x10x10 regular grid, as in A, you will end up with only 10 samples of the relevant input. If you sample with a uniform distribution, all 1000 samples will be informative.

As a variation on B, consider using a quasi-random sequence of inputs, such as the Sobol sequence. The advantage over a uniform distribution is that your input space coverage won't have clusters or holes.