Search code examples
pythonpandasamazon-s3parquetpython-s3fs

Read Parquet files with Pandas from S3 bucket directory with Proxy


I would like to read a S3 directory with multiple parquet files with same schema. The implemented code works outside the proxy, but the main problem is when enabling the proxy, I'm facing the following issue.

Traceback (most recent call last):
  File "script.py", line 158, in <module>
    df = pq.read_table(source=bucket_path, filesystem=s3).to_pandas()
  File "pyarrow\parquet\__init__.py", line 2737, in read_table
    dataset = _ParquetDatasetV2(
  File "\pyarrow\parquet\__init__.py", line 2351, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
  File "pyarrow\dataset.py", line 694, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "pyarrow\dataset.py", line 447, in _filesystem_dataset
    factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
  File "pyarrow\_dataset.pyx", line 2031, in pyarrow._dataset.FileSystemDatasetFactory.__init__
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 's3://test/files/part-00000-ed788628-0a6d-4ce9-b604-dd4c6ec75b6d-c000.snappy.parquet', which is outside base dir 's3://test/files/'

Here is the code. I commented the other solution I've tried:

import pyarrow.parquet as pq
import s3fs

bucket_path = 's3://test/files/'

os.environ['https_proxy'] = 'http://proxy.com:4200'
# proxies = {
#        'https': f''http://proxy.com:4200',
#        'http': f'http://proxy.com:4200'
#    }

# s3 = s3fs.S3FileSystem(anon=False, config_kwargs={'proxies': proxies})

s3 = s3fs.S3FileSystem(anon=False)

df = pq.read_table(source=bucket_path, filesystem=s3).to_pandas()

I couldn't find anyone with the same problem. Any help is welcomed.

Thank you in advance.


Solution

  • Replace:

    bucket_path = 's3://test/files/'
    

    with:

    bucket_path = '/test/files/'
    

    Since the path will be passed to the given filesystem instance inside of pyarrow, according to the document of fsspec, the path should come without a scheme.