I have a function f
that takes N
real-valued inputs and is very expensive to compute. Each of the N
inputs, call one by n
, has a range of values (n_min, n_max)
that it can take on. I am interested in studying the properties of f
by generating samples on various inputs and seeing what outputs it produces. (The goal is to use ML to build an approximator for f
.)
Let's say that due to time constraints I can only generate 1000 samples. Is it "better" to select the set of N
s that I feed to f
by
(A) iterating from n_min
to n_max
with a big enough step size for each n
, or
(B) uniformly sampling each n
on the range of (n_min, n_max)
?
Choice (A) has the desirable property of keeping every other input fixed while varying only one value at a time, but choice (B) has the desirable property of likely exploring more parts of the input space.
B is better when the function does not have equal variance over all the inputs, which it probably does not. In the extreme case, imagine you have 1000 samples, 3 inputs, but only one of them actually affects the function. If you sample over a 10x10x10 regular grid, as in A, you will end up with only 10 samples of the relevant input. If you sample with a uniform distribution, all 1000 samples will be informative.
As a variation on B, consider using a quasi-random sequence of inputs, such as the Sobol sequence. The advantage over a uniform distribution is that your input space coverage won't have clusters or holes.