google-cloud-platform google-cloud-storage google-cloud-run

write a csv from pandas dataframe to google cloud storage in chunks

I tried the following:

gs_path and temp_gs_path are both gcs paths of the type "gs://bucket/file.csv".
Code is running from whithin cloud run and has access to the cloud storage.

header = True
to_csv_mode = 'w'
with pd.read_csv(gs_path, chunksize=100000) as reader:
    for r in reader:
        r.to_csv(temp_gs_path, index=False, header=header, mode=to_csv_mode)
        header = False
        to_csv_mode = 'a'

But the file created in the gcs bucket is always overwritten and not appended after the first time (to_csv_mode = 'a' is ignored). So in the end I end up with the last chunk in the file.

Solution

Google Cloud Storage is the Object Storage service in Google Cloud. An object is an immutable piece of data consisting of a file of any format.

As per the official Documentation,

Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime. An object's storage lifetime is the time between successful object creation, such as uploading, and successful object deletion. In practice, this means that you cannot make incremental changes to objects, such as append operations or truncate operations. However, it is possible to replace objects that are stored in Cloud Storage, and doing so happens atomically: until the new upload completes, the old version of the object is served to readers, and after the upload completes the new version of the object is served to readers. So a single replacement operation simply marks the end of one immutable object's lifetime and the beginning of a new immutable object's lifetime.

Which means append is not a functionality that Google Cloud Storage supports. If you write to the same object name, it is always going to replace the existing object.

To achieve this you can follow a workaround Compose Objects, by creating a temporary file and save them as a each chunk, and append them together as a new file and then you can delete a temporary files.