Search code examples
phpperformancefilelarge-filesiconv

PHP using iconv to convert large files


I need to convert text files' character encodings without hogging the server's memory, while the input file is user configured and its size isn't limited.

Would it be more efficient to wrap an unix's iconv command using exec() (which I'd rather avoid, although I already use it in the application for other file operations), or should I read the file line by line and output it into another file?

I'm thinking working this way:

$in = fopen("in.txt", "r");
$out = fopen("out.txt", "w+");
while(($line = fgets($in, 4096)) !== false) {
    $converted = iconv($charset["in"], $charset["out"], $line);
    fwrite($out, $converted);
}
rename("out.txt", "in.txt");

Is there any better approach to convert the file fast and efficiently? I'm thinking this might be rather CPU intensive, but then I believe iconv itself is an expensive task so I'm not sure if I can make it actually not eat the server much at all.

Thanks!


Solution

  • Alright, thanks for the input, I did "my homework" based on it and got the results, working with 50MB sample of actual CSV data:

    First, iterating over the file using PHP:

    $in = fopen("a.txt", "r");
    $out = fopen("p.txt", "w+");
    
    $start = microtime(true);
    
    while(($line = fgets($in)) !== false) {
        $converted = iconv("UTF-8", "EUC-JP//TRANSLIT", $line);
        fwrite($out, $converted);
    }
    
    $elapsed = microtime(true) - $start;
    echo "<br>Iconv took $elapsed seconds\r\n";
    


    Iconv took 2.2817220687866 seconds

    That's not so bad I guess, so I tried the exact same approach in #bash, so it wouldn't have to load all the file but iterate over each line instead (which might not exactly happen as I understand what Lajos Veres answered). Indeed, this method wasn't exactly efficient (CPU was under a heavy load all the time). Also, the output file is smaller than the other 2, although after a quick look it looks the same, so I must have made a mistake in the bash script, however, that shouldn't have such effect on performance anyway:

    #!/bin/bash
    echo "" > b.txt
    time echo $(
        while read line
        do
            echo $line |iconv -f utf-8 -t EUC-JP//TRANSLIT >> b.txt
        done < a.txt
    )
    

    real 9m40.535s user 2m2.191s sys 3m18.993s

    And then the classic approach which I would have expected to hog the memory, however, checking the CPU/Memory usage, it didn't seem to take any more memory than any other approach, therefore being a winner:

    #!/bin/bash
    time echo $(
        iconv -f utf-8 -t EUC-JP//TRANSLIT a.txt -o b2.txt
    )
    

    real 0m0.256s user 0m0.195s sys 0m0.060s

    I'll try to get a bigger file sample to test the 2 more efficient methods to make sure the memory usage doesn't get significant, however, the result seems obvious enough to assume the single pass through the whole file in bash is the most efficient (I didn't try that in PHP, as I believe loading an entire file to an array/string in PHP isn't ever a good idea).