Search code examples
pythondataframenumpydaskdata-analysis

How to return a numpy arrays with expected shape from large csv file?


My expected shape of the return array S1 are (20,10). Why it is (22,10)? (2)How can I extract some rows from df0 and df1 and to construct a new array efficiently?

The csv files is not large but they can be more than 8G and the parameter M can be larger than 2000.

My code is as follows.

import dask.dataframe as dd
import numpy as np
from tensorflow.keras.utils import to_categorical

# Define df's
file0 = './dataset_zeros.csv'
file1 = './dataset_ones.csv'
df0 = dd.read_csv(file0,dtype="str",header=None)
df1 = dd.read_csv(file1,dtype="str",header=None)
#Drop the index
df0 = df0.drop(0,axis=1)
df1 = df1.drop(0,axis=1)

def generate_S(file0, file1,init,M,N_in,N_out):
    a = int(M/N_out) # if M=20, N_out=2, then a=10
    #Read csv files
    df0 = dd.read_csv(file0,header=None)
    df1 = dd.read_csv(file1,header=None)
    # Drop the index
    df0 = df0.drop(0,axis=1)
    df1 = df1.drop(0,axis=1)
    
    start = init*a
    end = (init+1)*a

    # extract a=10 rows from df0 (Part 1)
    train_X0 = df0.loc[start:end,:] # select rows
    train_X0 = train_X0.iloc[:,:10] # select columns
    train_X0 = train_X0.values # Convert dataframe to array
    
    # extract a=10 rows from df1 (Part 1)
    train_X1 = df1.loc[start:end]
    train_X1 = train_X1.iloc[:,:10]
    train_X1 = train_X1.values
    

    # concatenate the two parts to a new array
    new_X = np.concatenate((train_X0, train_X1), axis=0)
    
    #================================
    #res = new_X.reshape(M,N_in)
    res= new_X
    return res

# Examples of Parameters
init = 2
M = 20
N_in = 10
N_out =2

# Call the function
S1= generate_S(file0,file1,init,M,N_in,N_out)

The dataframe df0 and df1 looks like enter image description here

Then I run

S1.compute_chunk_sizes()

The result is enter image description here


Solution

  • "My expected shape of the return array S1 are (20,10). Why it is (22,10)?" This is because I did not understand and check the index start and end: In df.loc[], both the start and end are taken into account! For example, if I want to extract 10 rows, I should set start=20; end=29, instead of start=20; end=30.

    The correct piece of code is:

    start = init*a
    end = (init+1)*a - 1
    # extract a=10 rows from df0 (Part 1)
    train_X0 = df0.loc[start:end,:] # select rows
    

    Therefore, the function generate_S() is modified as follows.

    def generate_S(file0, file1,init,M,N_in,N_out):
        a = int(M/N_out)
        #Read csv files
        df0 = dd.read_csv(file0,header=None)
        df1 = dd.read_csv(file1,header=None)
        # Drop the index
        df0 = df0.drop(0,axis=1)
        df1 = df1.drop(0,axis=1)
        
        start = init*a
        end = (init+1)*a - 1
        
        # extract a=10 rows from df0 (Part 1)
        train_X0 = df0.loc[start:end,:] # select rows
        train_X0 = train_X0.iloc[:,:10] # select columns
        train_X0 = train_X0.values # Convert dataframe to array
        
        # extract a=10 rows from df1 (Part 1)
        train_X1 = df1.loc[start:end]
        train_X1 = train_X1.iloc[:,:10]
        train_X1 = train_X1.values
        
        new_X = np.concatenate((train_X0, train_X1), axis=0)
        new_X.compute_chunk_sizes()
        
        #Test
        print("new_X.SHAPE:")
        print(new_X.shape)
        
        res = new_X.reshape(M,N_in)
        return res
    

    The function will return an array with shape (M, 10) (in this code, M=20). Part 1 of problem is solved.

    Part 2 of the problem is : new_X.compute_chunk_sizes() in the function generate_S() is very time consuming, when the csv files are large. Even worse, it gives a wrong result. For my large csv files, the shape of the new_X is:

    new_X.SHAPE:
    (1170, 784)
    

    But the expected one is (a, 784). Here, a=10. It seems that the function generate_S() operate on each chunk! (There are 117 chunks in this example.) I really want to it operate only once.

    I hope to find a correct and efficient method to implement this function.

    =====

    I have found the right method. dask is not necessary here. To generate the array from the large csv file, I can use keywords skiprows and nrows in pandas.read_csv(). Here is my new version of the function. It read lines from two csv files and merge them into one array.

    import pandas as pd
    
    def generate_S(file0, file1,init,M,N_in,N_out):
        a = int(M/N_out)
        #Read csv files
        df0 = pd.read_csv(file0,header=None,skiprows=(init-1)*a, nrows=a)
        df1 = pd.read_csv(file1,header=None,skiprows=(init-1)*a, nrows=a)
        # Drop the index
        df0 = df0.drop(0,axis=1)
        df1 = df1.drop(0,axis=1)
        #0
        train_X0 = df0.iloc[:,:-1] # select columns
        train_X0 = train_X0.values # Convert dataframe to array  
        #1
        train_X1 = df1.iloc[:,:-1]
        train_X1 = train_X1.values
        
        new_X = np.concatenate((train_X0, train_X1), axis=0)
        return new_X
    

    The problem is solved.