I am running a Notebook instance from the AI Platform on a E2 high-memory VM with 4 vCPUs and 32Gb RAM
I need to read a partitioned parquet file with about 1.8Gb from Google Storage using pandas
It needs to be completely loaded in RAM and I can't use Dask compute for it. Nonetheless, I tried loading through this route and it gave the same problem
When I download the file locally in the VM, I can read it with pd.read_parquet
.
The RAM consumption goes up to about 13Gb and then down to 6Gb when the file is loaded. It works.
df = pd.read_parquet("../data/file.parquet",
engine="pyarrow")
When I try to load it directly from Google Storage the RAM goes up to about 13Gb and then the kernel dies. No log, warnings or errors raised.
df = pd.read_parquet("gs://path_to_file/file.parquet",
engine="pyarrow")
Some info on the packages versions
Python 3.7.8
pandas==1.1.1
pyarrow==1.0.1
What could be causing it?
The problem was being caused by a deprecated image version on the VM.
According to GCP's support you can find if the image is deprecated by
The solution to it is to create a new Notebook Instance and export/import whatever you want to keep. That way the new VM will have an updated image which hopefully has a fix for the problem