Search code examples
pythonpandasnumpycsvieee-754

Reading decimal representation floats from a CSV with pandas


I am trying to read in the contents of a CSV file containing what I believe are IEEE 754 single precision floats, in decimal format.

By default, they are read in as int64. If I specify the data type with something like dtype = {'col1' : np.float32}, the dtype shows up correctly as float32, but they are just the same values as a float instead of an int, ie. 1079762502 becomes 1.079763e+09 instead of 3.435441493988037.

I have managed to do the conversion on single values with either of the following:

from struct import unpack

v = 1079762502

print(unpack('>f', v.to_bytes(4, byteorder="big")))
print(unpack('>f', bytes.fromhex(str(hex(v)).split('0x')[1])))

Which produces

(3.435441493988037,)
(3.435441493988037,)

However, I can't seem to implement this in a vectorised way with pandas:

import pandas as pd
from struct import unpack

df = pd.read_csv('experiments/test.csv')

print(df.dtypes)
print(df)

df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
#df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))

print(df)

Throws the following error

col1    int64
dtype: object
         col1
0  1079762502
1  1079345162
2  1078565306
3  1078738012
4  1078635652

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-c06d0986cc96> in <module>
      7 print(df)
      8 
----> 9 df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
     10 #df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
     11 

~/anaconda3/envs/test/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'to_bytes'

Or if I try the second way, TypeError: 'Series' object cannot be interpreted as an integer

I am at the limits of my Python knowledge here, I suppose I could iterate through every single row, cast to hex, then to string, then strip the 0x, unpack and store. But that seems very convoluted, and already takes several seconds on smaller sample datasets, let along for hundreds of thousands of entries. Am I missing something simple here, is there any better way of doing this?


Solution

  • CSV is a text format, IEEE 754 single precision floats are binary numeric format. If you have a CSV, you have text, it is not that format at all. If I understand you correctly, I think you mean you have text which represent integers (in decimal format) that correspond to a 32bit integer interpretation of your 32bit floats.

    So, for starters, when you read the data from a csv, pandas used 64 bit integers by default. So convert to 32bit integers, then re-interpret the bytes using .view:

    In [8]: df
    Out[8]:
             col1
    0  1079762502
    1  1079345162
    2  1078565306
    3  1078738012
    4  1078635652
    
    In [9]: df.col1.astype(np.int32).view('f')
    Out[9]:
    0    3.435441
    1    3.335940
    2    3.150008
    3    3.191184
    4    3.166780
    Name: col1, dtype: float32
    

    Decomposed into steps to help understand:

    In [10]: import numpy as np
    
    In [11]: arr = df.col1.values
    
    In [12]: arr
    Out[12]: array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652])
    
    In [13]: arr.dtype
    Out[13]: dtype('int64')
    
    In [14]: arr_32 = arr.astype(np.int32)
    
    In [15]: arr_32
    Out[15]:
    array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652],
          dtype=int32)
    
    In [16]: arr_32.view('f')
    Out[16]:
    array([3.4354415, 3.33594  , 3.1500077, 3.191184 , 3.1667795],
          dtype=float32)