Search code examples
perllwplwp-useragentdata-dumper

LWP::UserAgent Issue with Memory Usage During Large File PUT


I've been trying to transfer a large file using either LWP (or a web service API that depends on LWP) and running into the issue, no matter how I approach it, that the process crumbles at a certain point. On a whim, I watched top while my script runs and noticed that the memory usage balloons to over 40GB right before things start failing.

I thought the issue was the S3 APIs I used initially, so I decided to use LWP::UserAgent to connect to the server myself. Unfortunately the issues remain using just LWP: memory usage still balloons and while it goes longer before failing, it got halfway through the transfer and then had a segmentation fault.

Simply reading the file I want to transfer in segments works just fine and never takes memory usage above 1.4GB:

my $filename = "/backup/2022-12-13/accounts/backup.tar.gz";
my $size = -s $filename; 
my $chunkSize = (1024*1024*100);
my $parts = ceil($size / $chunkSize);

# open 9.6 GB file
open(my $file, '<', $filename) or die("Error reading file, stopped");
binmode($file); 

for (my $i = 0; $i <= $parts; $i++) {
    my $chunk;
    my $offset = $i * $chunkSize + 1;

    read($file, $chunk, $chunkSize, $offset);

    # Code to do what I need to do with the chunk goes here.
    sleep(5);

    print STDOUT "Uploaded $i of $parts.\n";
}

However, adding in the LWP code suddens raises the memory usage significantly and, as I said, eventually gets a segmentation fault (at 55% of the transfer). Here's a minimal, complete, reproducible example:

use POSIX;
use HTTP::Request::Common;
use Net::Amazon::Signature::V4;
my $awsSignature = Net::Amazon::Signature::V4->new( $config{'access_key_id'}, $config{'access_key'}, 'us-east-1', 's3' );

# Get Upload ID from Amazon.
our $simpleS3 = Amazon::S3->new({
    aws_access_key_id  => $config{'access_key_id'},
    aws_secret_access_key => $config{'access_key'},
    retry => 1
}); 
my $bucket = $simpleS3->bucket($bucketName); 
my $uploadId = $bucket->initiate_multipart_upload('somebigobject');

my $filename = "/backup/2022-12-13/accounts/backup.tar.gz";
my $size = -s $filename; 
my $chunkSize = (1024*1024*100);
my $parts = ceil($size / $chunkSize);

# open 9.6 GB file
open(my $file, '<', $filename) or die("Error reading file, stopped");
binmode($file); 

for (my $i = 0; $i <= $parts; $i++) {
    my $chunk;
    my $offset = $i * $chunkSize + 1;

    read($file, $chunk, $chunkSize, $offset);

    # Code to do what I need to do with the chunk goes here.
    my $request = HTTP::Request::Common::PUT("https://bucket.s3.us-east-1.amazonaws.com/somebigobject?partNumber=" . ($i + 1) . "&uploadId=" . $uploadId);
    $request->header('Content-Length' => length($chunk));
    $request->content($chunk);
    my $signed_request = $awsSignature->sign( $request );
    
    my $ua = LWP::UserAgent->new();
    my $response = $ua->request($signed_request);
    
    my $etag = $response->header('Etag');
    
    # Try to make sure nothing lingers after this loop ends.
    $signed_request = '';
    $request = '';
    $response = '';
    $ua = '';           
        
    ($partList{$i + 1}) = $etag =~ m#^"(.*?)"$#;

    print STDOUT "Uploaded $i of $parts.\n";
}

The same issue occurs -- just even sooner in the process -- if I use Paws::S3, Net::Amazon::S3::Client or Amazon::S3. It appears each chunk somehow stays in memory. As the code progresses I can see a gradual but significant increase in memory usage until it hits that wall at around 40GB. Here's the bit that replaces sleep(5) in the real world code:

        $partList{$i + 1} = $bucket->upload_part_of_multipart_upload('some-big-object', $uploadId, $i + 1, $chunk);

The final code that fails because it uses so much memory:

use Amazon::S3;
our $simpleS3 = Amazon::S3->new({
    aws_access_key_id  => $config{'access_key_id'},
    aws_secret_access_key => $config{'access_key'},
    retry => 1
}); 

my $filename = "/backup/2022-12-13/accounts/backup.tar.gz";
my $size = -s $filename; 
my $chunkSize = (1024*1024*100);
my $parts = ceil($size / $chunkSize);
my %partList;

my $uploadId = $bucket->initiate_multipart_upload('some-big-object');

# open 9.6 GB file
open(my $file, '<', $filename) or die("Error reading file, stopped");
binmode($file); 

for (my $i = 0; $i <= $parts; $i++) {
    my $chunk;
    my $offset = $i * $chunkSize + 1;

    read($file, $chunk, $chunkSize, $offset);

    # Code to do what I need to do with the chunk goes here.
    $partList{$i + 1} = $bucket->upload_part_of_multipart_upload('some-big-object', $uploadId, $i + 1, $chunk);

    print STDOUT "Uploaded $i of $parts.\n";
}

Solution

  • The problem wasn't actually LWP or the S3 API, but a stupid error in how I was reading the files. I was using read($file, $chunk, $chunkSize, $offset);.

    Which was creating filler with $offset where I was thinking it was offsetting itself in the file by that much. This was creating chunks that grew in size until it finally crashed. Instead, the code needs to be:

    seek ($file, $offset, 0);
    read ($file, $chunk, $chunkSize);
    

    Which produces the expected chunk size.