dataframe google-cloud-platform google-cloud-datalab

Data preparation and description with large datasets in Datalab

I'm currently working with a 6 GB csv file in order to extract some insights from data in Google Cloud Platform. I use to do that work with Cloud Datalab, cause I find it a good tool to visualize the data. The problem comes when I try to load all the information in a dataframe. As I'm running Datalab in a VM, I assume that the performance depends on the power of that VM. Currently, I receive a timeout each time I try to load the registers in the dataframe (even trying with a VM of 4 CPUs and 15GB RAM). Is there any standard procedure to clean and visualize data (using dataframes if possible), with large datasets in GCP? Maybe I'm just choosing the wrong option.

Any help would be much appreciated.

Solution

As an update, I found a way to load the csv file into a dataframe with a different library instead of pandas (called 'Dask': [link] (dask.pydata.org/en/latest)). Actually, I was able to make some basic operations really quick. Anyway, I think that the solution for working with very large files is to use a sample data which is representative enough.