My expected shape of the return array S1 are (20,10). Why it is (22,10)? (2)How can I extract some rows from df0 and df1 and to construct a new array efficiently?
The csv files is not large but they can be more than 8G and the parameter M can be larger than 2000.
My code is as follows.
import dask.dataframe as dd
import numpy as np
from tensorflow.keras.utils import to_categorical
# Define df's
file0 = './dataset_zeros.csv'
file1 = './dataset_ones.csv'
df0 = dd.read_csv(file0,dtype="str",header=None)
df1 = dd.read_csv(file1,dtype="str",header=None)
#Drop the index
df0 = df0.drop(0,axis=1)
df1 = df1.drop(0,axis=1)
def generate_S(file0, file1,init,M,N_in,N_out):
a = int(M/N_out) # if M=20, N_out=2, then a=10
#Read csv files
df0 = dd.read_csv(file0,header=None)
df1 = dd.read_csv(file1,header=None)
# Drop the index
df0 = df0.drop(0,axis=1)
df1 = df1.drop(0,axis=1)
start = init*a
end = (init+1)*a
# extract a=10 rows from df0 (Part 1)
train_X0 = df0.loc[start:end,:] # select rows
train_X0 = train_X0.iloc[:,:10] # select columns
train_X0 = train_X0.values # Convert dataframe to array
# extract a=10 rows from df1 (Part 1)
train_X1 = df1.loc[start:end]
train_X1 = train_X1.iloc[:,:10]
train_X1 = train_X1.values
# concatenate the two parts to a new array
new_X = np.concatenate((train_X0, train_X1), axis=0)
#================================
#res = new_X.reshape(M,N_in)
res= new_X
return res
# Examples of Parameters
init = 2
M = 20
N_in = 10
N_out =2
# Call the function
S1= generate_S(file0,file1,init,M,N_in,N_out)
The dataframe df0 and df1 looks like
Then I run
S1.compute_chunk_sizes()
"My expected shape of the return array S1 are (20,10). Why it is (22,10)?" This is because I did not understand and check the index start
and end
: In df.loc[]
, both the start
and end
are taken into account! For example, if I want to extract 10 rows, I should set start=20; end=29
, instead of start=20; end=30
.
The correct piece of code is:
start = init*a
end = (init+1)*a - 1
# extract a=10 rows from df0 (Part 1)
train_X0 = df0.loc[start:end,:] # select rows
Therefore, the function generate_S()
is modified as follows.
def generate_S(file0, file1,init,M,N_in,N_out):
a = int(M/N_out)
#Read csv files
df0 = dd.read_csv(file0,header=None)
df1 = dd.read_csv(file1,header=None)
# Drop the index
df0 = df0.drop(0,axis=1)
df1 = df1.drop(0,axis=1)
start = init*a
end = (init+1)*a - 1
# extract a=10 rows from df0 (Part 1)
train_X0 = df0.loc[start:end,:] # select rows
train_X0 = train_X0.iloc[:,:10] # select columns
train_X0 = train_X0.values # Convert dataframe to array
# extract a=10 rows from df1 (Part 1)
train_X1 = df1.loc[start:end]
train_X1 = train_X1.iloc[:,:10]
train_X1 = train_X1.values
new_X = np.concatenate((train_X0, train_X1), axis=0)
new_X.compute_chunk_sizes()
#Test
print("new_X.SHAPE:")
print(new_X.shape)
res = new_X.reshape(M,N_in)
return res
The function will return an array with shape (M, 10)
(in this code, M=20
). Part 1 of problem is solved.
Part 2 of the problem is : new_X.compute_chunk_sizes()
in the function generate_S()
is very time consuming, when the csv files are large. Even worse, it gives a wrong result. For my large csv files, the shape of the new_X
is:
new_X.SHAPE:
(1170, 784)
But the expected one is (a, 784)
. Here, a=10
. It seems that the function generate_S()
operate on each chunk! (There are 117 chunks in this example.) I really want to it operate only once.
I hope to find a correct and efficient method to implement this function.
=====
I have found the right method. dask
is not necessary here. To generate the array from the large csv file, I can use keywords skiprows
and nrows
in pandas.read_csv()
. Here is my new version of the function. It read lines from two csv files and merge them into one array.
import pandas as pd
def generate_S(file0, file1,init,M,N_in,N_out):
a = int(M/N_out)
#Read csv files
df0 = pd.read_csv(file0,header=None,skiprows=(init-1)*a, nrows=a)
df1 = pd.read_csv(file1,header=None,skiprows=(init-1)*a, nrows=a)
# Drop the index
df0 = df0.drop(0,axis=1)
df1 = df1.drop(0,axis=1)
#0
train_X0 = df0.iloc[:,:-1] # select columns
train_X0 = train_X0.values # Convert dataframe to array
#1
train_X1 = df1.iloc[:,:-1]
train_X1 = train_X1.values
new_X = np.concatenate((train_X0, train_X1), axis=0)
return new_X
The problem is solved.