I tried the following:
header = True
to_csv_mode = 'w'
with pd.read_csv(gs_path, chunksize=100000) as reader:
for r in reader:
r.to_csv(temp_gs_path, index=False, header=header, mode=to_csv_mode)
header = False
to_csv_mode = 'a'
But the file created in the gcs bucket is always overwritten and not appended after the first time (to_csv_mode = 'a'
is ignored). So in the end I end up with the last chunk in the file.
Google Cloud Storage is the Object Storage service in Google Cloud. An object is an immutable piece of data consisting of a file of any format.
As per the official Documentation,
Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime. An object's storage lifetime is the time between successful object creation, such as uploading, and successful object deletion. In practice, this means that you cannot make incremental changes to objects, such as append operations or truncate operations. However, it is possible to replace objects that are stored in Cloud Storage, and doing so happens atomically: until the new upload completes, the old version of the object is served to readers, and after the upload completes the new version of the object is served to readers. So a single replacement operation simply marks the end of one immutable object's lifetime and the beginning of a new immutable object's lifetime.
Which means append is not a functionality that Google Cloud Storage supports. If you write to the same object name, it is always going to replace the existing object.
To achieve this you can follow a workaround Compose Objects, by creating a temporary file and save them as a each chunk, and append them together as a new file and then you can delete a temporary files.