Search code examples
pythonpandasdataframenumpynumpy-ndarray

How do I modify this function to return a 4d array instead of 3d?


I created this function that takes in a dataframe to return an ndarrays of input and label.

def transform_to_array(dataframe, chunk_size=100):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,]) # original inpt shape: [0, 1, chunk_size, 4]

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:

        inputs = group.loc[:, 'A':'D'].values 
        label = group.loc[:, 'label'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            inputs = np.array_split(
                 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

    return X, y

The function returned X of shape (n_samples, 1, chunk_size, 4) and y of shape (n_samples, ).

For examples:

N = 10_000
id = np.arange(N)
labels = np.random.randint(5, size=N)
df = pd.DataFrame(data = np.random.randn(N, 4),  columns=list('ABCD'))

df['label'] = labels
df.insert(0, 'id', id)
df = df.loc[df.id.repeat(157)]

df.head()
    id      A            B          C            D    label
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1

To generate the followings:

X, y = transform_to_array(df)

X.shape   # shape of input
(20000, 1, 100, 4)
y.shape   # shape of label
(20000,)

This function works fine as intended, however, it takes long time to finish execution:

start_time = time.time()
X, y = transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 227.83956217765808 seconds.

In attempt to improve performance of the function (minimise exec. time), I created the following modified func:

def modified_transform_to_array(dataframe, chunk_size=100):
    # group data by 'id'
    grouped = dataframe.groupby('id')
    # initialize lists to store transformed data
    X, y = [], []

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:
        # get input and label data for group
        inputs = group.loc[:, 'A':'D'].values 
        label = group.loc[:, 'label'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            # split input data into chunks
            inputs = np.array_split(
             inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            # pad input data to have a chunk size of chunk_size
            inpt = np.pad(
            inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each input split and corresponding label to lists
            X.append(inpt)
            y.append(label)

    # convert lists to numpy arrays
    X = np.array(X)
    y = np.array(y)

    return X, y

At first, it seems like I succeeded reducing time taken:

start_time = time.time()
X2, y2 = modified_transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 5.842168092727661 seconds.

However, the result is that it changes the shape of the intended returned value.

X2.shape  # this should be (20000, 1, 100, 4)
(20000, 100, 4)

y.shape  # this is fine
(20000, )

Question

How do I modify modified_transform_to_array() to return the intended array shape (n_samples, 1, chunk_size, 4) since it is much faster?


Solution

  • You can simply reshape the X just before returning it at the end of modified_transform_to_array(), e.g.:

    def modified_transform_to_array( ... ):
    
        ...
    
        # convert lists to numpy arrays
        X = np.array(X)
        y = np.array(y)
        X = X.reshape((X.shape[0], 1, *X.shape[1:]))  # <-- THIS LINE
        return X, y
    

    or, equivalently:

    X = X.reshape((X.shape[0], 1, X.shape[1], X.shape[2]))
    

    As pointed out in @MSS's answer, you can achieve the same reshaping result also with slicing, by starting from a a slicing where you are selecting the whole array (i.e. X[:, :, :]) and inserting a None (or its more explicit alias np.newaxis) in the position where you want to augment the number of dimensions:

    X = X[:, None, :, :]
    X = X[:, np.newaxis, :, :]
    

    The last two slicing can be replaced by an Ellipsis ... which essentially produces enough full-axis slicing (i.e. : or slice(None)) to fill the whole array dimensions.

    X = X[:, None, ...]
    X = X[:, np.newaxis, ...]
    

    You may want to read the relevant section of NumPy's user guide for further explanations on the use of None and Ellipsis in NumPy's slicing.