I am running a spark application in Amazon EMR Cluster and since a few days ago, I am getting the following error whenever I try reading a file from S3 using pandas. I have added bootstrap actions to install pandas, fsspec and s3fs.
Code:
import pandas as pd
df = pd.read_csv(s3_path)
Error Log:
Traceback (most recent call last):
File "spark.py", line 84, in <module>
df=pd.read_csv('s3://<bucketname>/<filename>.csv')
File "/usr/local/lib64/python3.7/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib64/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
filepath_or_buffer, encoding, compression
File "/usr/local/lib64/python3.7/site-packages/pandas/io/common.py", line 222, in get_filepath_or_buffer
filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
File "/usr/local/lib/python3.7/site-packages/fsspec/core.py", line 133, in open
out = self.__enter__()
File "/usr/local/lib/python3.7/site-packages/fsspec/core.py", line 101, in __enter__
f = self.fs.open(self.path, mode=mode)
File "/usr/local/lib/python3.7/site-packages/fsspec/spec.py", line 844, in open
**kwargs
File "/usr/local/lib/python3.7/site-packages/s3fs/core.py", line 394, in _open
autocommit=autocommit, requester_pays=requester_pays)
File "/usr/local/lib/python3.7/site-packages/s3fs/core.py", line 1276, in __init__
cache_type=cache_type)
File "/usr/local/lib/python3.7/site-packages/fsspec/spec.py", line 1134, in __init__
self.details = fs.info(path)
File "/usr/local/lib/python3.7/site-packages/s3fs/core.py", line 719, in info
return sync(self.loop, self._info, path, bucket, key, kwargs, version_id)
File "/usr/local/lib/python3.7/site-packages/fsspec/asyn.py", line 51, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/fsspec/asyn.py", line 35, in f
result[0] = await future
File "/usr/local/lib/python3.7/site-packages/s3fs/core.py", line 660, in _info
Key=key, **version_id_kw(version_id), **self.req_kw)
File "/usr/local/lib/python3.7/site-packages/s3fs/core.py", line 214, in _call_s3
raise translate_boto_error(err)
File "/usr/local/lib/python3.7/site-packages/s3fs/core.py", line 207, in _call_s3
return await method(**additional_kwargs)
File "/usr/local/lib/python3.7/site-packages/aiobotocore/client.py", line 121, in _make_api_call
operation_model, request_dict, request_context)
File "/usr/local/lib/python3.7/site-packages/aiobotocore/client.py", line 140, in _make_request
return await self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.7/site-packages/aiobotocore/endpoint.py", line 90, in _send_request
exception):
File "/usr/local/lib/python3.7/site-packages/aiobotocore/endpoint.py", line 199, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/local/lib/python3.7/site-packages/aiobotocore/hooks.py", line 29, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/utils.py", line 1225, in redirect_from_error
new_region = self.get_bucket_region(bucket, response)
File "/usr/local/lib/python3.7/site-packages/botocore/utils.py", line 1283, in get_bucket_region
headers = response['ResponseMetadata']['HTTPHeaders']
TypeError: 'coroutine' object is not subscriptable
sys:1: RuntimeWarning: coroutine 'AioBaseClient._make_api_call' was never awaited
Could there be an issue with the s3fs as this and pandas seems to be the only packages that received updates, but I couldn't find anything in the pandas' changelogs related to this?
Dask/s3fs team has acknowledged this to be a bug. This Github issue suggests that the aiobotocore is unable to get the region_name for the S3 bucket.
If you are facing the same issue, either consider downgrading s3fs to 0.4.2
or else try setting the environment variable AWS_DEFAULT_REGION
as a workaround.
Edit: It has been fixed in the latest release of aiobotocore=1.1.1
. Upgrade your aiobotocore and s3fs if you are facing the same issue.