I am trying to partition a 2D numpy array into 2 separate numpy arrays based on the contents of a particular column. This is my code:
import numpy as np
import pandas as pd
@profile
def partition_data(arr,target_colm):
total_colms = arr.shape[1]
target_data = arr[:,target_colm]
type1_data = []
type2_data = []
for i in range(arr.shape[0]):
if target_data[i]==0: # if value==0, put in another array
type1_data = np.append(type1_data,arr[i])
else:
type2_data = np.append(type2_data,arr[i])
type1_data = np.array(type1_data).reshape(int(len(type1_data)/total_colms),total_colms)
type2_data = np.array(type2_data).reshape(int(len(type2_data)/total_colms),total_colms)
return type1_data, type2_data
d = pd.read_csv('data.csv').values
x,y = partition_data(d,7) # check values of 7th column
Note: For my experiment, I have used a array of (14359,42) elements.
Now, when I profile this function using kernprof line profiler, I get the following results.
Wrote profile results to code.py.lprof
Timer unit: 1e-06 s
Total time: 7.3484 s
File: code2.py
Function: part_data at line 8
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def part_data(arr,target_col):
10 1 7.0 7.0 0.0 total_colms = arr.shape[1]
11 1 14.0 14.0 0.0 target_data = arr[:,target_col]
12 1 2.0 2.0 0.0 type1_data = []
13 1 1.0 1.0 0.0 type2_data = []
14 5161 40173.0 7.8 0.5 for i in range(arr.shape[0]):
15 5160 39225.0 7.6 0.5 if target_data[i]==6:
16 4882 7231260.0 1481.2 98.4 type1_data = np.append(type1_data,arr[i])
17 else:
18 278 33915.0 122.0 0.5 type2_data = np.append(type2_data,arr[i])
19 1 3610.0 3610.0 0.0 type1_data = np.array(type1_data).reshape(int(len(type1_data)/total_colms),total_colms)
20 1 187.0 187.0 0.0 type2_data = np.array(type2_data).reshape(int(len(type2_data)/total_colms),total_colms)
21 1 3.0 3.0 0.0 return type1_data, type2_data
Here, one line-16 takes up significant time. In future, the real data size I will work with will be much bigger.
Can anyone please suggest a faster method of partitioning a numpy array?
This should make it alot faster:
def partition_data_vectorized(arr, target_colm):
total_colms = arr.shape[1]
target_data = arr[:,target_colm]
mask = target_data == 0
type1_data = arr[mask, :]
type2_data = arr[~mask, :]
return (
type1_data.reshape(int(type1_data.size / total_colms), total_colms),
type2_data.reshape(int(type2_data.size / total_colms), total_colms))
Some timings:
# Generate some sample inputs:
arr = np.random.rand(10000, 42)
arr[:, 7] = np.random.randint(0, 10, 10000)
%timeit c, d = partition_data_vectorized(arr, 7)
# 2.09 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a, b = partition_data(arr, 7)
# 4.07 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is 2000 times faster than the non-vectorized calculation!
Comparing the results:
np.all(b == d)
# Out: True
np.all(a == c)
# Out: True
So the results are correct and it is 2000 times faster just by replacing the for-loop and the repeated array creation with np.append
by vectorized operations.