Search code examples
phpperlurl-encodingpipingproc-open

php - Piping input to perl process automatically decodes url-encoded string


I'm using proc_open to pipe some text over to a perl script for faster processing. The text includes url-encoded strings as well as literal spaces. When a url-encoded space appears in the raw text, it seems to be decoded into a literal space by the time it reaches the perl script. In the perl script, I rely on the positioning of the literal spaces, so these unwanted spaces mess up my output.

Why is this happening, and is there a way to prevent it from happening?

Relevant code snippet:

$descriptorspec = array(
    0 => array("pipe", "r"),
    1 => array("pipe", "w"),
);
$cmd = "perl script.pl";
$process = proc_open($cmd, $descriptorspec, $pipes);
$output = "";

if (is_resource($process)) {
    fwrite($pipes[0], $raw_string);
    fclose($pipes[0]);
    while (!feof($pipes[1])) {
        $output .= fgets($pipes[1]);
    }
    fclose($pipes[1]);
    proc_close($process);
}

and a line of raw text input looks something like this:

key url\tvalue1\tvalue2\tvalue3

I might be able to avoid the issue by converting the formatting of my input, but for various reasons that is undesirable, and circumvents rather than solves, the key issue.

Furthermore, I know that the issue is occurring somewhere between the php script and the perl script because I have examined the raw text (with an echo) immediately before writing it to the perl scripts STDIN pipe, and I have tested my perl script directly on url-encoded raw strings.

I've now added the perl script below. It basically boils down to a mini map-reduce job.

use strict;

my %rows;
while(<STDIN>) {
    chomp;
    my @line = split(/\t/);
    my $key = $line[0];
    if (defined @rows{$key}) {
        for my $i (1..$#line) {
            $rows{$key}->[$i-1] += $line[$i];
        }
    } else {
        my @new_row;
        for my $i (1..$#line) {
            push(@new_row, $line[$i]);
        }
        $rows{$key} = [ @new_row ];
    }
}

my %newrows;
for my $key (keys %rows) {
    my @temparray = split(/ /, $key);
    pop(@temparray);
    my $newkey = join(" ", @temparray);
    if (defined @newrows{$newkey}) {
        for my $i (0..$#{ $rows{$key}}) {
            $newrows{$newkey}->[$i] += $rows{$key}->[$i] > 0 ? 1 : 0;
        }
    } else {
        my @new_row;
        for my $i (0..$#{ $rows{$key}}) {
            push(@new_row, $rows{$key}->[$i] > 0 ? 1 : 0);
        }
        $newrows{$newkey} = [ @new_row ];
    }
}

for my $key (keys %newrows) {
    print "$key\t", join("\t", @{ $newrows{$key} }), "\n";
}

Solution

  • Note to self: always check your assumptions. It turns out that somewhere in my hundreds of millions of lines of input there were, in fact, literal spaces where there should have been url-encoded spaces. It took a while to find them, since there were hundreds of millions of correct literal spaces, but there they were.

    Sorry guys!