I am trying to read in the contents of a CSV file containing what I believe are IEEE 754 single precision floats, in decimal format.
By default, they are read in as int64. If I specify the data type with something like dtype = {'col1' : np.float32}
, the dtype shows up correctly as float32, but they are just the same values as a float instead of an int, ie. 1079762502
becomes 1.079763e+09
instead of 3.435441493988037
.
I have managed to do the conversion on single values with either of the following:
from struct import unpack
v = 1079762502
print(unpack('>f', v.to_bytes(4, byteorder="big")))
print(unpack('>f', bytes.fromhex(str(hex(v)).split('0x')[1])))
Which produces
(3.435441493988037,)
(3.435441493988037,)
However, I can't seem to implement this in a vectorised way with pandas:
import pandas as pd
from struct import unpack
df = pd.read_csv('experiments/test.csv')
print(df.dtypes)
print(df)
df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
#df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
print(df)
Throws the following error
col1 int64
dtype: object
col1
0 1079762502
1 1079345162
2 1078565306
3 1078738012
4 1078635652
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-8-c06d0986cc96> in <module>
7 print(df)
8
----> 9 df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
10 #df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
11
~/anaconda3/envs/test/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5178 return self[name]
-> 5179 return object.__getattribute__(self, name)
5180
5181 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'to_bytes'
Or if I try the second way, TypeError: 'Series' object cannot be interpreted as an integer
I am at the limits of my Python knowledge here, I suppose I could iterate through every single row, cast to hex, then to string, then strip the 0x, unpack and store. But that seems very convoluted, and already takes several seconds on smaller sample datasets, let along for hundreds of thousands of entries. Am I missing something simple here, is there any better way of doing this?
CSV is a text format, IEEE 754 single precision floats are binary numeric format. If you have a CSV, you have text, it is not that format at all. If I understand you correctly, I think you mean you have text which represent integers (in decimal format) that correspond to a 32bit integer interpretation of your 32bit floats.
So, for starters, when you read the data from a csv, pandas
used 64 bit integers by default. So convert to 32bit integers, then re-interpret the bytes using .view
:
In [8]: df
Out[8]:
col1
0 1079762502
1 1079345162
2 1078565306
3 1078738012
4 1078635652
In [9]: df.col1.astype(np.int32).view('f')
Out[9]:
0 3.435441
1 3.335940
2 3.150008
3 3.191184
4 3.166780
Name: col1, dtype: float32
Decomposed into steps to help understand:
In [10]: import numpy as np
In [11]: arr = df.col1.values
In [12]: arr
Out[12]: array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652])
In [13]: arr.dtype
Out[13]: dtype('int64')
In [14]: arr_32 = arr.astype(np.int32)
In [15]: arr_32
Out[15]:
array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652],
dtype=int32)
In [16]: arr_32.view('f')
Out[16]:
array([3.4354415, 3.33594 , 3.1500077, 3.191184 , 3.1667795],
dtype=float32)