algorithm parallel-processing fpga pipeline

Pipelined algorithm definition

I read that FPGA is suited for algorithms which are parallel or which can be pipelined. What is by definition an algorithm that can be pipelined?

Solution

It means you can split a task T into multiple steps T₁, T₂,...,T_n and each of these steps is more or less independent. Now the data is first injected into a processor P₁ that performs task T₁ on it, after a time step the results of P₁ are transferred to P₂ where task T₂ is executed. The point is that at that time, processor P₁ is again available, so you load the next chunk of data that needs to be processed into processor P₁. You can compare it to an assembly line where each worker (processor) does his/her part in a large process. A process that can be pipelined is efficient because the time it takes to process n chunks of data scales with n but still requires the same amount of hardware as if you would process only one chunk of data (evidently there will be some overhead in order to organize this).

Note that with processor, I do not mean a physical processor (like an 80x86), I simply mean a device that can do a certain job. Whether it requires an instruction set, memory, clock cycles, etc. is irrelevant.

Not all algorithms can be pipelined because sometimes there are dependencies between data which makes it hard/impossible to let the data be processed in chunks: you need all data available at once, or you cannot process it (or at least not efficiently).

As @Paebbels says (see comment below) Such processors or processing elements (PEs) or processing units (PUs) can be implemented in FPGAs. It's possible to map a network of PEs onto the FPGA area especially when many bit operation or non power of 2 data types are required. FPGAs perform mostly bad if floating point operations or fast DRAM access is required. Then GPUs or standard CPUs might be faster. Note: FPGAs are mounted onto PCIe cards so even a x100 faster algorithm can be slower compared to CPU algorithms, because latency or PCIe transfer rates will eat all benefits.

The point is nevertheless to achieve speedup without a substantial increase of hardware cost.

Analogy

In my course text of digital electronics, they made an analogy with a launderette. Say you want to do your laundry. Now evidently you cannot put all these clothes in the washing machine and dryer at once: you need to split it into ten parts.

Now say you have a machine that acts as both a washing machine and a dryer. And it requires two time steps to do the washing and drying. Then it would require twenty time steps to do your laundry, and you use a single machine.

A solution is to hire ten washing machines and ten dryers. Put all the clothes in the washing machines, then when it is done put all the clothes in the dryers and you are done in two steps. The downside is that you need to hire ten washing machines and dryers.

A solution using pipelining is that you hire one washing machine, and one dryer. Now you put the first batch of clothes in the washing machine. When done you put the washed clothes in the dryer, but in the meantime you put a next batch of clothes into the washing machine. Thus the washing machine and the dryer (processors) work in parallel, but at a different chuck of clothes (data). At each time step, you remove the clothes from the dryer, put clothes from the washing machine in the dryer, and put a new batch of clothes into the washing machine. As a result you will have eleven time steps, but only have to hire one washing machine and one dryer. When it comes to costs and time, pipelining can be efficient.