Search code examples
pythonamazon-s3boto3kedro

Are S3 Kedro datasets thread-safe?


CSVS3DataSet/HDFS3DataSet/HDFS3DataSet use boto3, which is known to be not thread-safe https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-multiprocessing

Is it OK to use these datasets with the ParallelRunner?


Solution

  • Kedro uses s3fs, which uses boto3 library to access S3. Boto3 is not thread-safe indeed, but only if you are trying to reuse the same Session object.

    All Kedro S3 datasets maintain separate instances of S3FileSystem, which means separate boto sessions, so it's safe.

    It's probably not great in terms of performance, and if you work with hundreds of S3 data sets in parallel, or thousands of small S3 datasets sequentially - the pipeline might run quite long and even fail on connection errors, but you are totally safe with a few dozens of them.