Search code examples
pythonpandasnumpyformatting

Pandas Split Scientific Notation into two Columns - Significand and Exponent


I have a column in a data frame called MARKET_VALUE that I need to pass to a downstream system in a defined format. MARKET_VALUE, a float, needs to be passed as two integer columns (significand, with no trailing zeros and exp) as follows

MARKET VALUE   SIGNIFICAND    EXP
6.898806e+09       6898806     3
6.898806e+05       6898806    -1
6.898806e+03       6898806    -3

I contemplated using formatted strings but am convinced there must be a smarter solution. The data frame is large, containing millions of rows, so a solution that doesn't depend on apply would be preferable.


Solution

  • Generate a random pandas dataframe

    I use a DataFrame consiting in 1e5 rows (you could try with more to test the bottleneck)

    import pandas as pd
    import numpy as np
    
    df=pd.DataFrame(np.random.random((100000,2))**10, columns=['random1', 'random2'])
    

    Use .apply method

    In this case I use the standard python formatting.

    • 8E is the number of digits after point.
    • [:-4] to remove the exponential notation and keep only the significand.
    • [-3:] to get only the exponential with the sign, then convert it into a int value.
    # get the signficand
    df.random1.apply(lambda x: f'{x:.8E}'[:-4].replace('.', ''))
        
    # get the exp
    df.random1.apply(lambda x: int(f'{x:.0E}'[-3:])) 
    

    On my laptop it took less than 100ms.

    I am thinking about faster solution (vectorized one), but for now I hope that this can help.