I m working with large dataset having low memory and I got introduced to Dask dataframe. What I understood from the docs that Dask does not load whole dataset into memory . instead it created multiple threads which will fetch the records from disk on demand basis. So I suppose keras model with having batch size = 500, it should only have 500 records in the memory at the training time. but when I start training. it takes forever.May be I am doing something wrong.please suggest.
shape of training data: 1000000 * 1290
import glob
import dask.dataframe
paths_train = glob.glob(r'x_train_d_final*.csv')
X_train_d = dd.read_csv('.../x_train_d_final0.csv')
Y_train1 = keras.utils.to_categorical(Y_train.iloc[,1], num_classes)
batch_size = 500
num_classes = 2
epochs = 5
model = Sequential()
model.add(Dense(645, activation='sigmoid', input_shape=(1290,),kernel_initializer='glorot_normal'))
#model.add(Dense(20, activation='sigmoid',kernel_initializer='glorot_normal'))
model.add(Dense(num_classes, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(decay=0),
metrics=['accuracy'])
history = model.fit(X_train_d.to_records(), Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
class_weight = {0:1,1:6.5},
shuffle=False)
Today Keras does not know about Dask dataframes or arrays. I suspect that it is just converting the dask object into the equivalent Pandas or Numpy object instead.
If your Keras model can be trained incrementally then you could solve this problem using dask.delayed and some for loops.
Eventually it would be nice to see the Keras and Dask projects learn more about each other to facilitate these workloads without excess work.