Search code examples
djangoazuredockeramazon-s3gunicorn

Issue uploading files to S3 from Django in Docker


We have migrated our application from running on Heroku to running on Azure App Services using Docker. When the application is run within the docker container, all attempts to upload files to S3 timeout and cause the gunicorn worker to exit. After spending many hours trying to debug and searching for answers on the web, I am stuck. Does anyone have any idea why this might be happening only in Docker?

packages being used:

Django==3.2.25
boto3==1.35.9
botocore==1.35.9
s3transfer==0.10.2
django-storages==1.14.4

Settings:

DEFAULT_FILE_STORAGE = 'project.utils.storage.HashedFilenameMediaS3Boto3Storage'

Storage class - this storage class setup was implemented about 8 years ago so maybe I need to make some changes?:

def HashedFilenameMetaStorage(storage_class):
    class HashedFilenameStorage(storage_class):
        def __init__(self, *args, **kwargs):
            # Try to tell storage_class not to uniquify filenames.
            # This class will be the one that uniquifies.
            try:
                new_kwargs = dict(kwargs) #, uniquify_names=False)
                super(HashedFilenameStorage, self).__init__(*args,
                                                            **new_kwargs)
            except TypeError:
                super(HashedFilenameStorage, self).__init__(*args, **kwargs)

        def get_available_name(self, name):
            raise NoAvailableName()

        def _get_content_name(self, name, content, chunk_size=None):
            dir_name, file_name = os.path.split(name)
            file_ext = os.path.splitext(file_name)[1]
            file_root = self._compute_hash(content=content,
                                           chunk_size=chunk_size)
            # file_ext includes the dot.
            out = os.path.join(dir_name, file_root + file_ext)
            return out

        def _compute_hash(self, content, chunk_size=None):
            if chunk_size is None:
                chunk_size = getattr(content, 'DEFAULT_CHUNK_SIZE',
                                     File.DEFAULT_CHUNK_SIZE)

            hasher = hashlib.sha1()

            cursor = content.tell()
            content.seek(0)
            try:
                while True:
                    data = content.read(chunk_size)
                    if not data:
                        break
                    hasher.update(data)
                return hasher.hexdigest()
            finally:
                content.seek(cursor)

        def save (self, name, content, **kwargs):
            # Get the proper name for the file, as it will actually be saved.
            if name is None:
                name = content.name

            name = self._get_content_name(name, content)
            name = self._save(name, content)

            # Store filenames with forward slashes, even on Windows
            return force_text(name.replace('\\', '/'))

        def _save(self, name, content, *args, **kwargs):
            new_name = self._get_content_name(name=name, content=content)
            try:
                return super(HashedFilenameStorage, self)._save(new_name,
                                                                content,
                                                                *args,
                                                                **kwargs)
            except NoAvailableName:
                # File already exists, so we can safely do nothing
                # because their contents match.
                pass
            except OSError as e:
                if e.errno == EEXIST:
                    # We have a safe storage layer and file exists.
                    pass
                else:
                    raise
            return new_name

    HashedFilenameStorage.__name__ = 'HashedFilename' + storage_class.__name__
    return HashedFilenameStorage


HashedFilenameFileSystemStorage = HashedFilenameMetaStorage(
    storage_class=FileSystemStorage,
)


# HashedFilenameMetaStorage returns a class that wraps another file storage backend
# this way storages can be used, but all files will use a hashed filename
HashedFilenameMediaS3Boto3Storage = lambda: HashedFilenameMetaStorage(storage_class=S3Boto3Storage)(location=os.environ.get('AWS_MEDIA_PREFIX'))

I enabled the boto debug output using:

import boto3
boto3.set_stream_logger('')

When attempting to upload a file to S3, this is what the boto debug output is:

2024-08-30T02:12:29.998638943Z [export_project_list_items] save file to s3
2024-08-30T02:12:30.001444653Z 2024-08-29 19:12:30,000 botocore.hooks [DEBUG] Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
2024-08-30T02:12:30.005178965Z 2024-08-29 19:12:30,004 botocore.hooks [DEBUG] Changing event name from before-call.apigateway to before-call.api-gateway
2024-08-30T02:12:30.006846971Z 2024-08-29 19:12:30,006 botocore.hooks [DEBUG] Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
2024-08-30T02:12:30.009351479Z 2024-08-29 19:12:30,009 botocore.hooks [DEBUG] Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
2024-08-30T02:12:30.009625280Z 2024-08-29 19:12:30,009 botocore.hooks [DEBUG] Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
2024-08-30T02:12:30.011194186Z 2024-08-29 19:12:30,010 botocore.hooks [DEBUG] Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
2024-08-30T02:12:30.012315189Z 2024-08-29 19:12:30,011 botocore.hooks [DEBUG] Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
2024-08-30T02:12:30.017287306Z 2024-08-29 19:12:30,017 botocore.hooks [DEBUG] Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
2024-08-30T02:12:30.017670207Z 2024-08-29 19:12:30,017 botocore.hooks [DEBUG] Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
2024-08-30T02:12:30.017824608Z 2024-08-29 19:12:30,017 botocore.hooks [DEBUG] Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
2024-08-30T02:12:30.018180709Z 2024-08-29 19:12:30,017 botocore.hooks [DEBUG] Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
2024-08-30T02:12:30.043070893Z 2024-08-29 19:12:30,042 botocore.loaders [DEBUG] Loading JSON file: /usr/local/lib/python3.12/site-packages/boto3/data/s3/2006-03-01/resources-1.json
2024-08-30T02:12:30.046231604Z 2024-08-29 19:12:30,045 botocore.loaders [DEBUG] Loading JSON file: /usr/local/lib/python3.12/site-packages/botocore/data/endpoints.json
2024-08-30T02:12:30.092729061Z 2024-08-29 19:12:30,092 botocore.loaders [DEBUG] Loading JSON file: /usr/local/lib/python3.12/site-packages/botocore/data/sdk-default-configuration.json
2024-08-30T02:12:30.093298663Z 2024-08-29 19:12:30,092 botocore.hooks [DEBUG] Event choose-service-name: calling handler <function handle_service_name_alias at 0x7d18f0760ea0>
2024-08-30T02:12:30.122171061Z 2024-08-29 19:12:30,121 botocore.loaders [DEBUG] Loading JSON file: /usr/local/lib/python3.12/site-packages/botocore/data/s3/2006-03-01/service-2.json.gz
2024-08-30T02:12:30.163320100Z 2024-08-29 19:12:30,162 botocore.loaders [DEBUG] Loading JSON file: /usr/local/lib/python3.12/site-packages/botocore/data/s3/2006-03-01/endpoint-rule-set-1.json.gz
2024-08-30T02:12:30.170335624Z 2024-08-29 19:12:30,170 botocore.loaders [DEBUG] Loading JSON file: /usr/local/lib/python3.12/site-packages/botocore/data/partitions.json
2024-08-30T02:12:30.172870432Z 2024-08-29 19:12:30,172 botocore.hooks [DEBUG] Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7d18f06cdf80>
2024-08-30T02:12:30.172891932Z 2024-08-29 19:12:30,172 botocore.hooks [DEBUG] Event creating-client-class.s3: calling handler <function lazy_call.<locals>._handler at 0x7d18e0e4b920>
2024-08-30T02:12:30.172908832Z 2024-08-29 19:12:30,172 botocore.hooks [DEBUG] Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7d18f06cdd00>
2024-08-30T02:12:30.173741335Z 2024-08-29 19:12:30,173 botocore.configprovider [DEBUG] Looking for endpoint for s3 via: environment_service
2024-08-30T02:12:30.173832936Z 2024-08-29 19:12:30,173 botocore.configprovider [DEBUG] Looking for endpoint for s3 via: environment_global
2024-08-30T02:12:30.173851536Z 2024-08-29 19:12:30,173 botocore.configprovider [DEBUG] Looking for endpoint for s3 via: config_service
2024-08-30T02:12:30.174637838Z 2024-08-29 19:12:30,173 botocore.configprovider [DEBUG] Looking for endpoint for s3 via: config_global
2024-08-30T02:12:30.174653738Z 2024-08-29 19:12:30,174 botocore.configprovider [DEBUG] No configured endpoint found.
2024-08-30T02:12:30.177363748Z 2024-08-29 19:12:30,176 botocore.endpoint [DEBUG] Setting s3 timeout as (60, 60)
2024-08-30T02:12:30.185855076Z 2024-08-29 19:12:30,183 botocore.loaders [DEBUG] Loading JSON file: /usr/local/lib/python3.12/site-packages/botocore/data/_retry.json
2024-08-30T02:12:30.185888776Z 2024-08-29 19:12:30,184 botocore.client [DEBUG] Registering retry handlers for service: s3
2024-08-30T02:12:30.185895276Z 2024-08-29 19:12:30,184 botocore.utils [DEBUG] Registering S3 region redirector handler
2024-08-30T02:12:30.185899476Z 2024-08-29 19:12:30,185 botocore.utils [DEBUG] Registering S3Express Identity Resolver
2024-08-30T02:12:30.187022880Z 2024-08-29 19:12:30,186 boto3.resources.factory [DEBUG] Loading s3:s3
2024-08-30T02:12:30.189355288Z 2024-08-29 19:12:30,188 boto3.resources.factory [DEBUG] Loading s3:Bucket
2024-08-30T02:12:30.189548489Z 2024-08-29 19:12:30,189 boto3.resources.model [DEBUG] Renaming Bucket attribute name
2024-08-30T02:12:30.190693093Z 2024-08-29 19:12:30,190 botocore.hooks [DEBUG] Event creating-resource-class.s3.Bucket: calling handler <function lazy_call.<locals>._handler at 0x7d18e0e4bba0>
2024-08-30T02:12:30.191063994Z 2024-08-29 19:12:30,190 boto3.resources.factory [DEBUG] Loading s3:Object
2024-08-30T02:12:30.192887500Z 2024-08-29 19:12:30,192 botocore.hooks [DEBUG] Event creating-resource-class.s3.Object: calling handler <function lazy_call.<locals>._handler at 0x7d18e0e4bc40>
2024-08-30T02:12:30.193260201Z 2024-08-29 19:12:30,193 boto3.s3.transfer [DEBUG] Opting out of CRT Transfer Manager. Preferred client: auto, CRT available: False, Instance Optimized: False.
2024-08-30T02:12:30.193546702Z 2024-08-29 19:12:30,193 boto3.s3.transfer [DEBUG] Using default client. pid: 117, thread: 137545880443040
2024-08-30T02:12:30.194640606Z 2024-08-29 19:12:30,194 s3transfer.utils [DEBUG] Acquiring 0
2024-08-30T02:12:30.195298208Z 2024-08-29 19:12:30,195 s3transfer.tasks [DEBUG] UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7d18e0574ef0>}) about to wait for the following futures []
2024-08-30T02:12:30.195728810Z 2024-08-29 19:12:30,195 s3transfer.tasks [DEBUG] UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7d18e0574ef0>}) done waiting for dependent futures
2024-08-30T02:12:30.195950610Z 2024-08-29 19:12:30,195 s3transfer.tasks [DEBUG] Executing task UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7d18e0574ef0>}) with kwargs {'client': <botocore.client.S3 object at 0x7d18e052fd10>, 'config': <boto3.s3.transfer.TransferConfig object at 0x7d18e289bec0>, 'osutil': <s3transfer.utils.OSUtils object at 0x7d18e055a330>, 'request_executor': <s3transfer.futures.BoundedExecutor object at 0x7d18e0559e20>, 'transfer_future': <s3transfer.futures.TransferFuture object at 0x7d18e0574ef0>}
2024-08-30T02:12:30.196321712Z 2024-08-29 19:12:30,196 s3transfer.futures [DEBUG] Submitting task PutObjectTask(transfer_id=0, {'bucket': 'my-bucket', 'key': 'files/exports/8e4cdb9f6b48daeb0f9a0b3c40d7f78591ac8f92.xml', 'extra_args': {'ContentType': 'text/xml'}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7d18e0559e20> for transfer request: 0.
2024-08-30T02:12:30.196471312Z 2024-08-29 19:12:30,196 s3transfer.utils [DEBUG] Acquiring 0

And this is the actual stacktrace that occurs when the gunicorn worker shuts down:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/s3transfer/tasks.py", line 269, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "/usr/local/lib/python3.12/site-packages/s3transfer/upload.py", line 597, in _submit
    self._submit_upload_request(
  File "/usr/local/lib/python3.12/site-packages/s3transfer/upload.py", line 632, in _submit_upload_request
    self._transfer_coordinator.submit(
  File "/usr/local/lib/python3.12/site-packages/s3transfer/futures.py", line 323, in submit
    future = executor.submit(task, tag=tag)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/s3transfer/futures.py", line 474, in submit
    future = ExecutorFuture(self._executor.submit(task))
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 165, in submit
    with self._shutdown_lock, _global_shutdown_lock:
  File "/usr/local/lib/python3.12/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
    sys.exit(1)
SystemExit: 1

Any suggestions would be greatly appreciated.


Solution

  • Finally discovered that it is not related to azure, but instead to docker and gunicorn.

    For some reason using the gevent worker type in a docker container causes S3 uploads to fail and the worker times out and exits.

    Switching to the gthread worker type solves the issue.

    Fails:

    gunicorn project.config.wsgi:application --bind 0.0.0.0:8000 --timeout 120 --preload --access-logfile '-' --error-logfile '-' --worker-class gevent --log-level 'debug'
    

    Succeeds:

    gunicorn project.config.wsgi:application --bind 0.0.0.0:8000 --timeout 120 --preload --access-logfile '-' --error-logfile '-' --worker-class gthread --log-level 'debug' --workers 8 --threads 8 --worker-connections 1000 --keep-alive 2