Search code examples
pythonpython-3.xaws-glueeggpyarrow

Use pyarrow in Glue pythonshell - ModuleNotFoundError: No module named 'pyarrow.lib'


Created a egg and whl file of pyarrow and put this on s3, for call this in pythonshell job. Received this message:

Job code:

import pyarrow
raise

Error, same structure for whl:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 118, in <module>
    runpy.run_path(temp_file_path, run_name='__main__')
  File "/usr/local/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/local/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/glue-python-scripts-e67xuz2j/genos.py", line 1, in <module>
  File "/glue/lib/installation/kanna-0.1-py3.6.egg/pyarrow/__init__.py", line 49, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ModuleNotFoundError: No module named 'pyarrow.lib'

PD: Cannot found the lib.py or lib folder in local files.


Solution

  • I was having the same problem with AWS Lambda and came across this question.

    For Glue, AWS docs state only pure python libraries can be used.AWS Glue docs sreenshot

    For Lambda:

    The underlying problem is that modules like pyarrow port their code from C/ C++. When you check pyarrow codebase, you will find in fact two pyarrow.lib files exist, but they have .pyx and .pxd file extensions. This is not pure Python code and therefore depends on underlying CPU architecture.

    I had to manually download .whl files for my required version for pyarrow and its dependency numpy. From http://pypi.org/project/pyarrow/, click on Download files and search for your matching version. cp39 means cpython 3.9. and x86 represents the CPU architecture. Follow the same steps for Numpy. I ended up downloading these files: pyarrow-8.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl and numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

    You then have to unzip them and create an archive where both sit together in a folder named Python. This folder can be used to create a layer in Lambda. Attach this layer to your project and import pyarrow should work.

    The other solution is to use custom Docker images. This worked for me as well. I believe the AWS docs are exhaustive on that topic. I have written a PoC and all the steps that I followed here.

    I followed this guide for creating a pyarrow layer.