Search code examples
pythonpandasmockingpytestmoto

Using moto with pandas read_parquet and to_parquet functions


I am trying to write a unit test for a function which uses pd.read_parquet() function and I am struggling to make it work. I have the code below

from moto import mock_aws
import pandas as pd
import pytest
import datetime as dt
import boto3
from my_module import foo

@pytest.fixture
def mock_df():
    cols = [
        "timestamp",
        "value"
    ]
    values = [
        [dt.datetime(2024, 1, 1, 0), 2.57],
        [dt.datetime(2024, 1, 1, 1), 1.41],
        [dt.datetime(2024, 1, 1, 2), 2.06],
    ]
    df = pd.DataFrame(values, columns=cols)
    return df


@mock_aws
def test_download(mock_df):
    bucket_name = "test-input-bucket"
    s3 = boto3.resource("s3", region_name="us-east-1")
    s3.create_bucket(Bucket=bucket_name)
    key1 = "s3://test-input-bucket/path/to/data.parquet"
    mock_df.to_parquet(key1) # code fails already here
    foo() # uses pd.read_parquet()

But I am getting this error

OSError: When initiating multiple part upload for key 'path/to/data.parquet'
in bucket 'test-input-bucket': AWS Error INVALID_ACCESS_KEY_ID during 
CreateMultipartUpload operation: The AWS Access Key Id you provided does not exist in our records.

I am getting the same error whether I use to_parquet or try to use read_parquet. Everything works fine, if I use something diffrent for the upload and download, like

s3_bucket.put_object(Key=key1, Body=mock_df.to_parquet())

However I am not interested in replacing the pandas functions as it is not possible in my situation and need to find a way to mock S3 while using them. Is there a way to make moto work with these functions?

EDIT: I am using these versions

boto3                                    1.28.64
botocore                                 1.31.64
moto                                     5.0.3

Solution

  • This fixed the issue on my end. I am not 100% applies to every single case.

    On our end we narrowed the issue indeed to using to_parquet or read_parquet in tests. Using fastparquet as the engine (engine='fastparquet') seemed to provide a solution, but for us it wasn't always possible.

    For some reason, pyarrow wants credentials, whereas other packages don't care. Adding credentials and forcing them at the creation of the connection did the trick for us. So something like

    
    @pytest.fixture
    def aws_credentials():
        """Mocked AWS Credentials for moto."""
        os.environ["AWS_ACCESS_KEY_ID"] = "testing"
        os.environ["AWS_SECRET_ACCESS_KEY"] = "testing"
        os.environ["AWS_SECURITY_TOKEN"] = "testing"
        os.environ["AWS_SESSION_TOKEN"] = "testing"
    
    
    @mock_aws
    def test_file(aws_credentials, mock_df):
        with mock_aws(aws_credentials):
            conn = boto3.client("s3", region_name="us-east-1")
            yield conn
        conn.create_bucket(Bucket="testbucket")
    
    

    allowed us to have access to the bucket. Let me know if it works on your end as well.