How to return big string from spawned process in Ruby?

From my Ruby script I am spawning a PhantomJS process. That process will return a JSON string to Ruby, and Ruby will parse it.

Everything works fine, but when the PhantomJS script returns a huge JSON (more than 10000 entries), it seems that the Ruby doesn't know how to handle it (it doesn't get the whole string, but it gets cut).

An entry in the JSON looks like this (but some entries can have about 5 more attributes):

{"id" : 3, "nSteps" : 5, "class" : "class-name", "width" : 300, "height" : 500, "source" : "this is a big string", "nMove" : 10, "id-name" : "this is a big string", "checked" : true, "visible" : false}

This is the code I have right now:

@pid = Process.spawn("phantomjs",
                       "myparser.js",
                       :out => pipe_cmd_out, :err => pipe_cmd_out
                      )
  Timeout.timeout(400) do
    Process.wait(@pid)
    pipe_cmd_out.close
    output = pipe_cmd_in.read
    return JSON.parse(output)
  end

Is there any way I can read the JSON by chunks or somehow increase the buffer limit of the pipe?

EDIT:

In order to send the data from PhantomJS to Ruby, I have the following at the very end of my PhantomJS script:

console.log(JSON.stringify(data));
phantom.exit();

If I launch the PhantomJS script from the terminal I get the JSON correctly. However, when I do it from within Ruby, the response get cut.

The size of the string that is being put in the console.log when it breaks is: 132648

EDIT:

I think I found out what is the exact problem. When I specify an :out when spawning the process, if the JSON returned is big (132648 length) it won't let the Ruby to read it. So when doing:

reader, writer = IO.pipe
pid = Process.spawn("phantomjs",
                    "script.js",
                    :out => writer
                    )
Timeout.timeout(100) do
  Process.wait(pid)
  writer.close
  output = reader.read
  json_output = JSON.parse(output)
end

It won't work.

But if I let PhantomJS to just write to its standard stdout, it will output the JSON correctly. So, doing:

reader, writer = IO.pipe
pid = Process.spawn("phantomjs",
                    "script.js"
                    )
Timeout.timeout(100) do
  Process.wait(pid)
  writer.close
  output = reader.read
  json_output = JSON.parse(output)
end

Will output the results in the terminal correctly. So I believe the problem is that somehow for big JSON it doesn't write correctly, or the Ruby reader doesn't know how to read it.

Solution

I suspect the problem isn't Ruby, but either how or when you're reading.

It could be that PhantomJS hasn't finished sending the output before you read, leading to a partial response. Try routing the output to a file to determine its size in bytes. That will tell you whether PhantomJS is completing its task and closing the JSON correctly, and will let you know how many bytes you can expect to see in the buffer.

...what about the pipe that connects Ruby and the spawned process. Where can I find what limits the buffer has?

Digging around in the dark corners of my memory to root out how this'd work...

This should be reasonably accurate from what I remember: The TCP/IP stack will buffer incoming data until its buffer is full, then it will tell the sending side to stop. Once the buffer is clear, because the script has read the buffer, the sender is told to resume sending. So, even if there are multiple GB of data pending, it won't all be sent at once unless the script is reading continuously from the buffer to keep it clear.

When the script does a read, read doesn't grab only what's in the buffer, it wants to see an EOF, and it shouldn't see that until the TCP/IP session thinks it's received everything from the sender and the session/connection closes. The Ruby I/O sub-system and the TCP/IP sub-system read the data in chunks received from the sender, and store it in the variable. In other words, your script should pause and wait until all the data is transferred, then continue, since read is a blocking action.

There are different ways of handling I/O. You're slurping the data, which is what we call it when we read everything in step. That's not scalable, because, if the incoming data is larger than Ruby's string can store, you've got a problem. Possibly you should be doing incremental reads, then storing the data in a temporary file, then stream that into the JSON parser using something like YAJL, but the first step is to determine whether you're actually getting the complete JSON string.

An alternate way of dealing with the problem is to request smaller sets of data, then reassemble them in your script. Just as requesting every record from a database via SQL is a bad idea, because it isn't scalable and beats up the DBM, maybe you should request your JSON data in pages or blocks, and only process the immediately necessary results.

This might be it...

@pid = Process.spawn("phantomjs",
                       "myparser.js",
                       :out => pipe_cmd_out, :err => pipe_cmd_out
                      )
  Timeout.timeout(400) do
    Process.wait(@pid)
    pipe_cmd_out.close
    output = pipe_cmd_in.read
    return JSON.parse(output)
  end

The STDOUT and STDERR of PhantomJS is being assigned to pipe_cmd_out, but you close that stream with pipe_cmd_out.close, then try to read pipe_cmd_in which isn't defined. That all seems wrong. I think you should do a pipe_cmd_in.close then pipe_cmd_out.read:

@pid = Process.spawn(
  "phantomjs",
  "myparser.js",
  :in => pipe_cmd_in,
  :out => pipe_cmd_out, 
  :err => pipe_cmd_out
)
Timeout.timeout(400) do
  pipe_cmd_in.close
  Process.wait(@pid)
  output = pipe_cmd_out.read
  return JSON.parse(output)
end

Be careful trying to parse the STDERR output though. It's likely to not be JSON, and will cause an exception when the parser throws up.

We close the output of our script/the input to the command-line application, because a lot of command-line tools that read input from STDIN will hang until their STDIN is closed. That's what pip_cmd_in.close will do, it closes PhantomJS's STDIN and signals to it that it should begin processing. Then, when it outputs to its STDOUT, your script should see that via the stream available in pipe_cmd_out.

And, rather than double up the output for STDOUT and STDERR into one variable, I'd probably use:

@pid = Process.spawn(
  "phantomjs",
  "myparser.js",
  :in => pipe_cmd_in,
  :out => pipe_cmd_out, 
  :err => pipe_cmd_err
)
Timeout.timeout(400) do
  pipe_cmd_in.close
  Process.wait(@pid)
  output = pipe_cmd_out.read
  if output.empty?
    pipe_cmd_err
  else
    JSON.parse(output)
  end
end

The code that calls the above code would need to sense whether the return value was a String or an Array or a Hash. If it's the first an error occurred. If it's one of the later two it was successful and you can iterate over them.