Search code examples
phpmultithreadingasynchronoushhvmhacklang

"walk" a PHP array using multithreaded (async) with hack (HHVM)


Not sure why there isn't yet a "hack" tag (sorry to list in PHP), but...

I am wondering if it/how it would be possible to walk an array using multiple threads using the multithreaded/async feature of hack. I don't really need this, but it is a curiosity and might be useful.

I've looked at the documentation for "Hack"'s async feature

http://docs.hhvm.com/manual/en/hack.async.php

and its a bit difficult.

Here is the basic idea of what I would like to make (or see done):

a) Split up the array into x sections and process it on x "threads" or b) create x threads and each processes the latest available item, ie. when it the thread processes the item, it asks the parent thread for a new one to process. Hack doesn't do "threads", but the same is represented by an asyc function

Basically, the end goal is to quickly optimize a standard foreach block to run on multiple threads, so minimal code change is required, and also to see what hack can do and how it works.

I've come up with some code as a sample, but I think I've totally got the idea wrong.

class ArrayWalkAsync
{
    protected $array;
    protected $threads = Array();
    protected $current_index = 0;
    protected $max_index;
    protected $threads = 4;

    public function array_walk($array)
    {
        $this->array = $array;
        $this->max_index = count($array) - 1;
        $result = Array();
        for ($i=0;$i<$this->threads;$i++)
        {
            $this->threads[] = new ArrayWalkThread();
        }
        $continue = true;
        while($continue)
        {
            $awaitables = Array();
            for ($i=0;$i<$this->threads;$i++)
            {
                $a = $this->proccesNextItem($i);
                if ($a)
                {
                    $this->threads[] = $a;
                } else {
                    $continue = false;
                }
            }
            // wait for each
            foreach ($awaitables as $awaitable_i)
            {
                await awaitable_i;
                // do something with the result
            }
        }
    }

    protected function proccesNextItem($thread_id)
    {
        if ($this->current_index > $this->max_index)
        {
            return false;
        }
        $a = new ArrayWalkItem();
        $a->value = $this->array[$this->current_index];
        $a->index = $this->current_index;
        $this->current_index++;
        return $this->threads[$thread_id]->process($a,$this);
    }

    public function processArrayItem($item)
    {
        $value = $item->value;
        sleep(1);
        $item->result = 1;
    }

}


class ArrayWalkThread
{
     async function process($value,$parent): Awaitable<?ArrayWalkItem>
     {
        $parent->processArrayItem($a);
     }

}

class ArrayWalkItem
{
    public $value;
    public $result;
}

Solution

  • Hack's async functions aren't going to do what you want. In Hack, async functions are not threads. It's a mechanism to hide IO latency and data fetching, not to do more than one computation at once. (This is the same as in C#, from where the Hack feature derives.)

    This blog post on async functions has a good explanation:

    For several months now, Hack has had a feature available called async which enables writing code that cooperatively multitasks. This is somewhat similar to threading, in that multiple code paths are executed in parallel, however it avoids the lock contention issues common to multithreaded code by only actually executing one section at any given moment.

    “What’s the use of that?”, I hear you ask. You’re still bound to one CPU, so it should take the same amount of time to execute your code, right? Well, that’s technically true, but script code execution isn’t the only thing causing latency in your application. The biggest piece of it probably comes from waiting for backend databases to respond to queries.

    [...]

    While [an http] call is busy sitting on its hands waiting for a response, there’s no reason you shouldn’t be able to do other things, maybe even fire off more requests. The same goes for database queries, which can take just as long, or even filesystem access which is faster than network, but can still introduce lag times of several milliseconds, and those all add up!

    Sorry for the confusion on this point -- you're not the only one to try to erroneously use async this way. The current docs do a terrible job of explaining this. We're doing a revamp of the docs; the current draft does a somewhat better job, but I'm going to go file a task to make sure it's crystal clear before we launch the new docs.