I am using a very large dataset with pandas, and in order to reduce my use of memory, I cast all my columns from float64 to float32 and from int64 to int32. One of the columns is a timestamp in nanoseconds (something of the sort of 1594686594613248). Before the casting, it has only positive values. After the casting, it has mostly negative values. Is there any kind of bug with astype('int32')? What am I missing here.
Relevant code:
data_uid_label = pd.read_csv('label_to_uid.csv')
types = data_uid_label.dtypes
for name in data_uid_label.columns:
if(types[name]=='float64'):
data_uid_label[name]=data_uid_label[name].astype('float32')
if(types[name]=='int64'):
data_uid_label[name]=data_uid_label[name].astype('int32')
thanks
1594686594613248 needs 51 bits to be represented, so it fits in a 64 bit number (int64), but not in a 32 bit one (int32). It overflows:
Only cast columns you are certain don't contain values which are too large to smaller types. Most of the time, it isn't even worth the minimal memory gain as long as you don't have millions of data points.