Search code examples
pandasgoogle-cloud-platformgoogle-cloud-storagedaskparquet

Why GCP gets killed when reading a partitioned parquet file from Google Storage but not locally?


I am running a Notebook instance from the AI Platform on a E2 high-memory VM with 4 vCPUs and 32Gb RAM

I need to read a partitioned parquet file with about 1.8Gb from Google Storage using pandas

It needs to be completely loaded in RAM and I can't use Dask compute for it. Nonetheless, I tried loading through this route and it gave the same problem

When I download the file locally in the VM, I can read it with pd.read_parquet. The RAM consumption goes up to about 13Gb and then down to 6Gb when the file is loaded. It works.

df = pd.read_parquet("../data/file.parquet",
                    engine="pyarrow")

When I try to load it directly from Google Storage the RAM goes up to about 13Gb and then the kernel dies. No log, warnings or errors raised.

df = pd.read_parquet("gs://path_to_file/file.parquet",
                    engine="pyarrow")

Some info on the packages versions

Python 3.7.8
pandas==1.1.1
pyarrow==1.0.1

What could be causing it?


Solution

  • The problem was being caused by a deprecated image version on the VM.

    According to GCP's support you can find if the image is deprecated by

    1. Go to GCE and click on “VM instances”.
    2. Click on the “VM instance” in question
    3. Look for the section “Boot disk” and click on the Image link.
    4. If the image has been Deprecated, there will be a field showing it.

    enter image description here

    The solution to it is to create a new Notebook Instance and export/import whatever you want to keep. That way the new VM will have an updated image which hopefully has a fix for the problem