Search code examples
phplaravelcsvfgetcsv

Reading > 1GB GZipped CSV files from external FTP server


In a scheduled task of my Laravel application I'm reading several large gzipped CSV files, ranging from 80mb to 4gb on an external FTP server, containing products which I store in my database based on a product attribute.

I loop through a list of product feeds that I want to import but each time a fatal error is returned: 'Allowed memory size of 536870912 bytes exhausted'. I can bump up the length parameter of the fgetcsv function from 1000 to 100000 which solves the problem for the smaller files (< 500mb) but for the larger files it will return the fatal error.

Is there a solution that allows me to either download or unzip the .csv.gz files, reading the lines (by batch or one by one) and inserting the products into my database without running out of memory?

$feeds = [
    "feed_baby-mother-child.csv.gz",
    "feed_computer-games.csv.gz",
    "feed_general-books.csv.gz",
    "feed_toys.csv.gz",
];

foreach ($feeds as $feed) {
    $importedProducts = array();
    $importedFeedProducts = 0;

    $csvfile = 'compress.zlib://ftp://' . config('app.ftp_username') . ':' . config('app.ftp_password') . '@' . config('app.ftp_host') . '/' . $feed;

    if (($handle = fopen($csvfile, "r")) !== FALSE) {
        $row = 1;
        $header = fgetcsv($handle, 1, "|");
                
        while (($data = fgetcsv($handle, 1000, "|")) !== FALSE) {
            if($row == 1 || array(null) !== $data){ $row++; continue; }
                    
            $product = array_combine($header, $data);
            $importedProducts[] = $product;
        }

        fclose($handle);
    } else {
        echo 'Failed to open: ' . $feed . PHP_EOL;
        continue;
    }
    
    // start inserting products into the database below here
}

Solution

  • The problem is probably not the gzip file itself, Of course you can download it, on process it then, this will keep the same issues.

    Because you are loading all products in a single array (Memory)

    $importedProducts[] = $product;
    

    You could comment this line out, and see it if this prevent's hitting your memory limit.

    Usually i would create a method like this addProduct($product) to handle it memory safe.

    You can then from there decide a max number of products before doing a bulk insert. to achieve optimal speed.. i usually use something between 1000 en 5000 rows.

    For example

    class ProductBatchInserter
    {
        private $maxRecords = 1000;
        private $records = [];
        
        function addProduct($record) {
            $this->records[] = $record;
            if (count($this->records) >= $this->maxRecords) {
               EloquentModel::insert($this->records);
               $this->records = [];
            }
        }
    }
    

    However i usualy don't implement it as a single class, but in my projects i used to integrate them as a BulkInsertable trait that could be used on any eloquent model.

    But this should give you an direction, how you can avoid memory limits.

    Or, the easier , but significantly slower, just insert the row where you now assign it to array. But that will put a ridiculous load on your database and will be really very slow.

    If the GZIP stream is the bottleneck

    As i expect this is not the issue, but if it would, then you could use gzopen()

    https://www.php.net/manual/en/function.gzopen.php

    and nest the gzopen handle as handle for fgetcsv.

    But i expect the streamhandler you are using, is doing this already the same way for you..

    If not, i mean like this:

    $input = gzopen('input.csv.gz', 'r'); 
    
    
    while (($row = fgetcsv($input)) !== false) {
     // do something memory safe, like suggested above
    }
    

    If you need to download it anyway there are many ways to do it, but make sure you use something memory safe, like fopen / fgets , or a guzzle stream and don't try to use something like file_get_contents() that loads it into memory