I have a pandas Dataframe of dimensions (m
,n
) that is filled with 0
and 1
.
If each row of the dataframe is considered as a binary number, I would like to generate a pandas Series with the integer number in base 10 represented by that row.
Given the following matrix of dimensions (m
,n
) filled with 0
and 1
:
m = int(1e6)
n = 5
df = pd.DataFrame(np.random.rand(m,n)).round().astype(int)
The method I use right now is this one:
df_asstr = df.astype(str)
bin_series = df_asstr.sum(axis=1).astype(int).astype(str)
def bin_to_int(strnum):
return int(strnum, 2)
decimal_series = bin_series.astype(str).apply(bin_to_int)
My issue here is TIMING. If the dataframe has length on the order of m=1e3
, then the whole process takes less than one second. However, when I with m=1e6
, it takes about 22 seconds, and I need to run many of these, so I really want to speed it up.
I am aware that the steps slowing down the process are those involving conversion of the DataFrame
to str
, i.e. these lines:
df_asstr = df.astype(str)
bin_series = df_asstr.sum(axis=1).astype(int).astype(str)
decimal_series = bin_series.astype(str).apply(bin_to_int)
Does anyone know a more efficient way to create the series of integers in decimal base?? Thanks a lot!
You can use dot
product with bitwise left-shift operator:
a = df.values
b = a.dot(1 << np.arange(a.shape[-1] - 1, -1, -1))
In [157]: %%timeit
...: a = df.values
...: b = pd.Series(a.dot(1 << np.arange(a.shape[-1] - 1, -1, -1)), index=df.index)
...:
16.8 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [158]: %%timeit
...: (2 ** (np.arange(start = len(df.columns), stop = 0, step = -1)-1) * df).sum(axis =1)
...:
81.5 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)