Search code examples
pythonnumpymappingline-profiler

Optimized method for mapping contents of a column in a 2D numpy array


I have a numpy 2D array containing integers between 0 to 100. For a particular column, I want to map the values in the following way:

0-4 mapped to 0
5-9 mapped to 5
10-14 mapped to 10, and so on.

This is my code:

import numpy as np
@profile
def map_column(arr,col,incr):
    col_data = arr[:,col]
    vec = np.arange(0,100,incr)
    for i in range(col_data.shape[0]):
        for j in range(len(vec)-1):
            if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
                col_data[i] = vec[j]
        if (col_data[i]>vec[-1]):
            col_data[i] = vec[-1]
    return col_data

np.random.seed(1)
myarr = np.random.randint(100,size=(80000,4))
x = map_column(myarr,2,5)

This code takes 8.3 seconds to run. The following is the output of running line_profiler on this code.

Timer unit: 1e-06 s
Total time: 8.32155 s
File: testcode2.py
Function: map_column at line 2
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           @profile
     3                                           def map_column(arr,col,incr):
     4         1         17.0     17.0      0.0      col_data = arr[:,col]
     5         1         34.0     34.0      0.0      vec = np.arange(0,100,incr)
     6     80001     139232.0      1.7      1.7      for i in range(col_data.shape[0]):
     7   1600000    2778636.0      1.7     33.4          for j in range(len(vec)-1):
     8   1520000    4965687.0      3.3     59.7              if (col_data[i]>=vec[j] and col_data[i]<vec[j+1]):
     9     76062     207492.0      2.7      2.5                  col_data[i] = vec[j]
    10     80000     221693.0      2.8      2.7          if (col_data[i]>vec[-1]):
    11      3156       8761.0      2.8      0.1              col_data[i] = vec[-1]
    12         1          2.0      2.0      0.0      return col_data

In future I have to work with real data much bigger than this one. Can anyone please suggest a faster method to do this?


Solution

  • I think this can be solved with an arithmetic expression, if I understand the question correctly:

    def map_column(arr,col,incr):
        col_data = arr[:,col]
        return (col_data//incr)*incr
    

    should do the trick. What happens here is that due to the integer division, the remainder is discarded. Thus, multiplying again with the increment, you get the next smaller number that is divisible by the increment.