I tried to search for the fastest approach to work with large data files in colab. I began to wonder if it would be better to upload them directly from the site (e.g. Kaggle), or to upload them onto the colab own directory and work with them from there. I was able to do the latter, but when the files began to unzip, the system suddenly stopped working and crashed. I tried again, and next time I waited longer until everything was unzipped. However, on the second step the system crashed again.
Would you suggest the best way to work with (large) datasets without crashing the system?
The code I was using:
1)
First I made and copied a json file from Kaggle in the main directory of colab.
from google.colab import drive
drive.mount('/content/drive')
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download forest-cover-type-prediction
After that, I tried to unzip the data files, downloaded from Kaggle in the directory of Colab
! mkdir unzipped
! unzip train.csv.zip -d unzipped
! unzip test.csv.zip -d unzipped
and then read the data from the csv
import numpy as np
import pandas as pd
train = pd.read_csv("/content/unzipped/train.csv")
test = pd.read_csv("/content/unzipped/test.csv")
X = train.to_numpy()[100000:5000000,0:4].astype(float)
Y = train.to_numpy()[100000:5000000,4].astype(int).flatten()
Question: How to upload directly from the hard drive, and which method is faster?
Try getting API token from Kaggle's account tab. Then upload it in the google colab and try the following code to Initialize the Kaggle library,
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
after the setup use the syntax below to download the dataset
! kaggle datasets download <name-of-dataset>
for more reference of the detailed work click here