I have a numpy 2D array containing integers between 0 to 100. For a particular column, I want to map the values in the following way:
0-4 mapped to 0
5-9 mapped to 5
10-14 mapped to 10, and so on.
This is my code:
import numpy as np
def map_column(arr,col,incr):
col_data = arr[:,col]
vec = np.arange(0,100,incr)
for i in range(col_data.shape[0]):
for j in range(len(vec)-1):
if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
col_data[i] = vec[j]
if (col_data[i]>vec[-1]):
col_data[i] = vec[-1]
return col_data
myarr = np.random.randint(100,size=(80000,4))
x = map_column(myarr,2,5)
This code takes 8.3 seconds to run. The following is the output of running line_profiler on this code.
Timer unit: 1e-06 s
Total time: 8.32155 s
File: testcode2.py
Function: map_column at line 2
Line # Hits Time Per Hit % Time Line Contents
2 @profile
3 def map_column(arr,col,incr):
4 1 17.0 17.0 0.0 col_data = arr[:,col]
5 1 34.0 34.0 0.0 vec = np.arange(0,100,incr)
6 80001 139232.0 1.7 1.7 for i in range(col_data.shape[0]):
7 1600000 2778636.0 1.7 33.4 for j in range(len(vec)-1):
8 1520000 4965687.0 3.3 59.7 if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
9 76062 207492.0 2.7 2.5 col_data[i] = vec[j]
10 80000 221693.0 2.8 2.7 if (col_data[i]>vec[-1]):
11 3156 8761.0 2.8 0.1 col_data[i] = vec[-1]
12 1 2.0 2.0 0.0 return col_data
In future I have to work with real data much bigger than this one. Can anyone please suggest a faster method to do this?
I think this can be solved with an arithmetic expression, if I understand the question correctly:
def map_column(arr,col,incr):
col_data = arr[:,col]
return (col_data//incr)*incr
should do the trick. What happens here is that due to the integer division, the remainder is discarded. Thus, multiplying again with the increment, you get the next smaller number that is divisible by the increment.