Search code examples
pythonpandasnumpynormalizationstochastic

Integer matrix to stochastic matrix normalization


Suppose I have matrix with integer values. I want to make it stochastic matrix (i.e. sum of each row in matrix equal to 1)

I create random matrix, count sum of each row and divide each element in row for row sum.

dt = pd.DataFrame(np.random.randint(0,10000,size=10000).reshape(100,100))
dt['sum_row'] = dt.sum(axis=1)
for col_n in dt.columns[:-1]:
    dt[col_n] = dt[col_n] / dt['sum_row']

After this sum of each row should be equal to 1. But it is not.

(dt.sum_row_normalized == 1).value_counts()
> False    75
> True     25
> Name: sum_row_normalized, dtype: int64

I understand that some values is not exactly 1 but very close to it. Nevertheless, how can I normalize matrix correctly?


Solution

  • You can't guarantee the floats will be exactly one, but you can check the closely to an arbitrary precision with np.around.

    This is probably easier/faster without looping through pandas columns.

    X = np.random.randint(0,10000,size=10000).reshape(100,100)
    X_float = X.astype(float)
    Y = X_float/X_float.sum(axis=1)[:,np.newaxis]
    
    sum(np.around(Y.sum(axis=1),decimals=10)==1) # is 100
    

    (you don't need the .astype(float) step in python 3.x)