I am using C++ amp to evalute mathematical expressions of the form (+ x y) in polish notation. Now the tricky part is that the expressions are given as trees, which I "compile" into linear instructions, basically using the property that a breadth traversal of the tree will give me a list of instructions which can be iterated backwards to ensure that each child node will have been evaluated before its parent.
struct amp_instruction
{
op_code opcode; // add, sub, variable, etc
int index; // index of first child
double value; // for constants
double weight; // for variables
std::string label; // node label
std::shared_ptr<concurrency::array_view<double, 1>> data; // amp data
};
When creating instructions, I assign the data field like this:
instr.data = make_shared<array_view<double, 1>>(n);
Then, my evaluation is:
array_view<double, 1> amp_interpreter::evaluate(vector<amp_instruction>& instructions)
{
for (auto &it = rbegin(instructions); it != rend(instructions); ++it)
{
switch (it->opcode)
{
case ADD:
{
array_view<double, 1> a = *instructions[it->index].data;
array_view<double, 1> b = *instructions[it->index + 1].data;
parallel_for_each(a.extent, [=](index<1> i) restrict(amp)
{
a[i] += b[i];
});
it->data = instructions[it->index].data;
break;
}
// other cases... //
case VARIABLE:
{
array_view<double, 1> a = *it->data;
array_view<const double, 1> v = *gpu_data[it->label];
double weight = it->weight;
parallel_for_each(a.extent, [=](index<1> i) restrict(amp)
{
a[i] = v[i] * weight;
});
break;
}
default: break;
}
}
return *instructions[0].data;
}
where gpu_data is a map holding the initial values for my variables (which can be up to a million for example). So the idea is, for each variable grab the values (cached in gpu_data), apply a weight value, and hold the result in teh data field of the corresponding amp_instruction. Then, the data is passed from child to parent in order to reduce memory allocations on the gpu.
Now, this code works fine when I compile my program in debug mode, using constant memory around ~1gb for 1000 tree expressions and 1M values for each tree variable. It also produces correct values, so the logic works. But in release mode, memory usage spikes to 10-20gb. This only happens with the default accelerator, which is my radeon r9 fury. The basic renderer accelerator does not have this issue.
My hardware is i7 4790k, 32gb ddr3, radeon r9 fury. Could this be a driver issue? Or maybe I am not using c++ amp as intended? I really hope someone could shed some light on this issue as this bug renders the whole approach unusable.
Thanks.
I was not able to pinpoint the source of the memory leak but it definitely comes from the runtime. Changing the "Runtime Library" in the project options from "Multi-threaded DLL (/MD)" to "Multi-threaded Debug DLL (/MDd)" eliminates the memory leak.