Search code examples
phpfopen

How to read two big files and compare contents


What i'm trying to do is to read big file 5.6GB have approximately 600Million lines and the second is 16MB have 2M lines.

I want to check the duplicate lines in these two files.

$wordlist = array_unique(array_filter(file('small.txt', FILE_IGNORE_NEW_LINES)));
$duplicate = array();
if($file = fopen('big.txt', 'r')){
    while(!feof($file)){
        $lines = rtrim(fgets($file));
        if(in_array($lines, $wordlist)){
            echo $lines." : exists.\n";
        }
    }
    fclose($file);
}

But this take forever to finish ( its been running from 6 hours and didn't finish yet :/ ).

My question is. Is there a better way to search in huge files fast?


Solution

  • You won't need to call array_filter() or array_unique() if you are going to call array_flip() -- it will eliminate the duplicates for you because you can't have duplicate keys in the same level of an array.

    Furthermore:

    1. array_unique() is stated to be slower than array_flip() (and there are times when it is slower than two array_flip()s)
    2. array_filter() has a bad reputation for killing falsey/empty/null/zero-ish data, so I will caution you not to use its default behavior.
    3. array_flip() sets up the very speedy isset() check. isset() will likely outperform array_key_exists() because isset() doesn't check for null values.
    4. I am adding the FILE_SKIP_EMPTY_LINES flag to file() call so that your lookup array is potentially smaller.
    5. Calling rtrim() of every line of your big file, may be causing some drag too. Do you know if you have consistently identical newline characters on both files? It would spare you six hundred millions calls of rtrim() if you can safely remove the FILE_IGNORE_NEW_LINES flag from the file() call. Alternatively, if you know the newlines (e.g. \n? or \r\n?) that trail the big.txt lines, you can append specific newline(s) to the $lookup keys -- this means preparing the smaller file's data versus every line of the big file.

    Untested Code:

    $lookup = array_flip(file('small.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES));
    if($file = fopen('big.txt', 'r')){
        while(!feof($file)){
            $line = rtrim(fgets($file));
            if (isset($lookup[$line])) {
                echo "$lines : exists.\n";
            }
        }
        fclose($file);
    }