Search code examples
pythonperformancepandasbinarytiming

pandas series bits to integer in decimal base


I have a pandas Dataframe of dimensions (m,n) that is filled with 0 and 1. If each row of the dataframe is considered as a binary number, I would like to generate a pandas Series with the integer number in base 10 represented by that row.

Given the following matrix of dimensions (m,n) filled with 0 and 1:

m = int(1e6)
n = 5
df = pd.DataFrame(np.random.rand(m,n)).round().astype(int)

The method I use right now is this one:

df_asstr = df.astype(str)
bin_series = df_asstr.sum(axis=1).astype(int).astype(str)

def bin_to_int(strnum):
    return int(strnum, 2)

decimal_series = bin_series.astype(str).apply(bin_to_int)

My issue here is TIMING. If the dataframe has length on the order of m=1e3, then the whole process takes less than one second. However, when I with m=1e6, it takes about 22 seconds, and I need to run many of these, so I really want to speed it up.

I am aware that the steps slowing down the process are those involving conversion of the DataFrame to str, i.e. these lines:

df_asstr = df.astype(str)
bin_series = df_asstr.sum(axis=1).astype(int).astype(str)
decimal_series = bin_series.astype(str).apply(bin_to_int)

Does anyone know a more efficient way to create the series of integers in decimal base?? Thanks a lot!


Solution

  • You can use dot product with bitwise left-shift operator:

    a = df.values
    b = a.dot(1 << np.arange(a.shape[-1] - 1, -1, -1))
    

    In [157]: %%timeit 
         ...: a = df.values
         ...: b = pd.Series(a.dot(1 << np.arange(a.shape[-1] - 1, -1, -1)), index=df.index)
         ...: 
    16.8 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [158]: %%timeit
         ...: (2 ** (np.arange(start = len(df.columns), stop = 0, step = -1)-1) * df).sum(axis =1)
         ...: 
    81.5 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)