I have a Glue Job that is started from a Lambda that is triggered by an S3 put event. This Glue Job has a some proprietary dependencies, that when using them locally are normally available via a private on-site host.
I believe this is an important point to mention: One of the wheels is a dependency of the other.
I've read the docs on --extra-py-files
and --additional-python-modules
flags. I've placed the related wheel files in an S3 bucket and used the --extra-py-files
flag to specify both of their locations in the order that would make sense... e.g.:
"--extra-py-files": 's3://<bucket>/path/to/<other-wheels-dependency>.whl, s3://<bucket>/path/to<wheel-with-the-dependency>'
The first module installs fine, I can see in the Job's logs that it was successfully installed. But it appears when the second's install is attempted that it is unable to locate the first wheel as its dependency.
Python is not my go to language, and this is my first attempt at a Glue Job. But I'm wondering, why isn't it able to find the first package after its install successfully completes?
There's a warning about the current user not having permissions to write to pip
's cache that I feel like could be related. But I'm not sure.
Related warning:
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The install errors are what you would expect:
ERROR: Could not find a version that satisfies the requirement <first wheel>>=1.0.5 (from <second-wheel>==1.0.5) (from versions: none)
ERROR: No matching distribution found for <first-wheel>>=1.0.5 (from <second-wheel>==1.0.5)
I'm not really sure what my options are to make this work. Is there some option I could add to the pip install
command so that I could specify only the second package in --extra-py-files
and provide it the s3 location of its dependency. Alternatively, could I package both in a single wheel?
It is hard to reproduce your bug, I would say packaging everything in a single wheel is the simplest thing to do.
Keep in mind few things (or rather gotchas):
Also, lambda now can support python 3.9, container images and a runtime up to 15 mins with 10GB ram. A lot of jobs can be done there avoiding glue, which I find quite a badly documented and expensive service in AWS.
Did you take a look at this link for glue jobs with dependencies? I found it quite helpful.