python dataframe numpy dask data-analysis

How to return a numpy arrays with expected shape from large csv file?

My expected shape of the return array S1 are (20,10). Why it is (22,10)? (2)How can I extract some rows from df0 and df1 and to construct a new array efficiently?

The csv files is not large but they can be more than 8G and the parameter M can be larger than 2000.

My code is as follows.

import dask.dataframe as dd
import numpy as np
from tensorflow.keras.utils import to_categorical

# Define df's
file0 = './dataset_zeros.csv'
file1 = './dataset_ones.csv'
df0 = dd.read_csv(file0,dtype="str",header=None)
df1 = dd.read_csv(file1,dtype="str",header=None)
#Drop the index
df0 = df0.drop(0,axis=1)
df1 = df1.drop(0,axis=1)

def generate_S(file0, file1,init,M,N_in,N_out):
    a = int(M/N_out) # if M=20, N_out=2, then a=10
    #Read csv files
    df0 = dd.read_csv(file0,header=None)
    df1 = dd.read_csv(file1,header=None)
    # Drop the index
    df0 = df0.drop(0,axis=1)
    df1 = df1.drop(0,axis=1)
    
    start = init*a
    end = (init+1)*a

    # extract a=10 rows from df0 (Part 1)
    train_X0 = df0.loc[start:end,:] # select rows
    train_X0 = train_X0.iloc[:,:10] # select columns
    train_X0 = train_X0.values # Convert dataframe to array
    
    # extract a=10 rows from df1 (Part 1)
    train_X1 = df1.loc[start:end]
    train_X1 = train_X1.iloc[:,:10]
    train_X1 = train_X1.values
    

    # concatenate the two parts to a new array
    new_X = np.concatenate((train_X0, train_X1), axis=0)
    
    #================================
    #res = new_X.reshape(M,N_in)
    res= new_X
    return res

# Examples of Parameters
init = 2
M = 20
N_in = 10
N_out =2

# Call the function
S1= generate_S(file0,file1,init,M,N_in,N_out)

The dataframe df0 and df1 looks like

Then I run

S1.compute_chunk_sizes()

The result is

Solution

"My expected shape of the return array S1 are (20,10). Why it is (22,10)?" This is because I did not understand and check the index start and end: In df.loc[], both the start and end are taken into account! For example, if I want to extract 10 rows, I should set start=20; end=29, instead of start=20; end=30.

The correct piece of code is:

start = init*a
end = (init+1)*a - 1
# extract a=10 rows from df0 (Part 1)
train_X0 = df0.loc[start:end,:] # select rows

Therefore, the function generate_S() is modified as follows.

def generate_S(file0, file1,init,M,N_in,N_out):
    a = int(M/N_out)
    #Read csv files
    df0 = dd.read_csv(file0,header=None)
    df1 = dd.read_csv(file1,header=None)
    # Drop the index
    df0 = df0.drop(0,axis=1)
    df1 = df1.drop(0,axis=1)
    
    start = init*a
    end = (init+1)*a - 1
    
    # extract a=10 rows from df0 (Part 1)
    train_X0 = df0.loc[start:end,:] # select rows
    train_X0 = train_X0.iloc[:,:10] # select columns
    train_X0 = train_X0.values # Convert dataframe to array
    
    # extract a=10 rows from df1 (Part 1)
    train_X1 = df1.loc[start:end]
    train_X1 = train_X1.iloc[:,:10]
    train_X1 = train_X1.values
    
    new_X = np.concatenate((train_X0, train_X1), axis=0)
    new_X.compute_chunk_sizes()
    
    #Test
    print("new_X.SHAPE:")
    print(new_X.shape)
    
    res = new_X.reshape(M,N_in)
    return res

The function will return an array with shape (M, 10) (in this code, M=20). Part 1 of problem is solved.

Part 2 of the problem is : new_X.compute_chunk_sizes() in the function generate_S() is very time consuming, when the csv files are large. Even worse, it gives a wrong result. For my large csv files, the shape of the new_X is:

new_X.SHAPE:
(1170, 784)

But the expected one is (a, 784). Here, a=10. It seems that the function generate_S() operate on each chunk! (There are 117 chunks in this example.) I really want to it operate only once.

I hope to find a correct and efficient method to implement this function.

=====

I have found the right method. dask is not necessary here. To generate the array from the large csv file, I can use keywords skiprows and nrows in pandas.read_csv(). Here is my new version of the function. It read lines from two csv files and merge them into one array.

import pandas as pd

def generate_S(file0, file1,init,M,N_in,N_out):
    a = int(M/N_out)
    #Read csv files
    df0 = pd.read_csv(file0,header=None,skiprows=(init-1)*a, nrows=a)
    df1 = pd.read_csv(file1,header=None,skiprows=(init-1)*a, nrows=a)
    # Drop the index
    df0 = df0.drop(0,axis=1)
    df1 = df1.drop(0,axis=1)
    #0
    train_X0 = df0.iloc[:,:-1] # select columns
    train_X0 = train_X0.values # Convert dataframe to array  
    #1
    train_X1 = df1.iloc[:,:-1]
    train_X1 = train_X1.values
    
    new_X = np.concatenate((train_X0, train_X1), axis=0)
    return new_X

The problem is solved.