I created this function that takes in a dataframe
to return an ndarrays
of input and label.
def transform_to_array(dataframe, chunk_size=100):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,]) # original inpt shape: [0, 1, chunk_size, 4]
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
inputs = group.loc[:, 'A':'D'].values
label = group.loc[:, 'label'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
The function returned X
of shape (n_samples, 1, chunk_size, 4)
and y
of shape (n_samples, )
.
For examples:
N = 10_000
id = np.arange(N)
labels = np.random.randint(5, size=N)
df = pd.DataFrame(data = np.random.randn(N, 4), columns=list('ABCD'))
df['label'] = labels
df.insert(0, 'id', id)
df = df.loc[df.id.repeat(157)]
df.head()
id A B C D label
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
To generate the followings:
X, y = transform_to_array(df)
X.shape # shape of input
(20000, 1, 100, 4)
y.shape # shape of label
(20000,)
This function works fine as intended, however, it takes long time to finish execution:
start_time = time.time()
X, y = transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 227.83956217765808 seconds.
In attempt to improve performance of the function (minimise exec. time), I created the following modified func:
def modified_transform_to_array(dataframe, chunk_size=100):
# group data by 'id'
grouped = dataframe.groupby('id')
# initialize lists to store transformed data
X, y = [], []
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
# get input and label data for group
inputs = group.loc[:, 'A':'D'].values
label = group.loc[:, 'label'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
# split input data into chunks
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
# pad input data to have a chunk size of chunk_size
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each input split and corresponding label to lists
X.append(inpt)
y.append(label)
# convert lists to numpy arrays
X = np.array(X)
y = np.array(y)
return X, y
At first, it seems like I succeeded reducing time taken:
start_time = time.time()
X2, y2 = modified_transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 5.842168092727661 seconds.
However, the result is that it changes the shape of the intended returned value.
X2.shape # this should be (20000, 1, 100, 4)
(20000, 100, 4)
y.shape # this is fine
(20000, )
Question
How do I modify modified_transform_to_array()
to return the intended array shape (n_samples, 1, chunk_size, 4)
since it is much faster?
You can simply reshape
the X
just before returning it at the end of modified_transform_to_array()
, e.g.:
def modified_transform_to_array( ... ):
...
# convert lists to numpy arrays
X = np.array(X)
y = np.array(y)
X = X.reshape((X.shape[0], 1, *X.shape[1:])) # <-- THIS LINE
return X, y
or, equivalently:
X = X.reshape((X.shape[0], 1, X.shape[1], X.shape[2]))
As pointed out in @MSS's answer, you can achieve the same reshaping result also with slicing, by starting from a a slicing where you are selecting the whole array (i.e. X[:, :, :]
) and inserting a None
(or its more explicit alias np.newaxis
) in the position where you want to augment the number of dimensions:
X = X[:, None, :, :]
X = X[:, np.newaxis, :, :]
The last two slicing can be replaced by an Ellipsis ...
which essentially produces enough full-axis slicing (i.e. :
or slice(None)
) to fill the whole array dimensions.
X = X[:, None, ...]
X = X[:, np.newaxis, ...]
You may want to read the relevant section of NumPy's user guide for further explanations on the use of None
and Ellipsis
in NumPy's slicing.