How to convert video volumes (after dense sampling) with different scales to descriptor?

I read this article (link) and try to understand algorithm, that presented there.
So, now I understand almost all points from this article, but have question:

As I understand, if I have video with 100 frames with 120*160, then I apply dense scale with different scales(for example [5*5*5, 10*10*10, 20*20*20] ), then I will get respectively [15360, 1920, 240] cubes. But, after that I need to make descriptors for each of them, and length of descriptors must be the same(in this article length of descriptor is the same as size of cube, so [125, 1000, 8000]).

One of the solutions, that I think is create for each pixel cubes in different scales and after that concatenate them in one vector with length 9125. Is it right?

Solution

So, I've found the answer.
Around each pixel I must to build cubes each size(so, it will be around 1920000 cubes for each size)