Recently I try to write a web spider, so I find some projects about web spider written in PHP.
In those projects, I found the extension "PCNTL" is used frequently, but I can't find any detail tutorials or manuals about it.
So I want to know if extension "PCNTL" is really suitable for a web spider? If not, what are the alternatives.
"PCNTL" is extensions with C-like process related functions, most notably fork
.
I am not sure if there are good tutorials, but you can check C / C++ examples in order to understand how to use those PHP functions.
Several years ago we did web crawler. Instead of fork
, we have used a shell script that started 100 instances of the crawler in parallel.
Another alternative is curl-multi
, but once again there is no enough information and tutorials for it. We tried it and we do not found it very reliable, but I believe you should check it.
Another alternative is to do it in Python - there are several different program libraries that gives a lot of possibilities.