Search code examples
pythonnumpycastingint32

changing numpy array from type int64 to type int32 corrupts the data


I am using a very large dataset with pandas, and in order to reduce my use of memory, I cast all my columns from float64 to float32 and from int64 to int32. One of the columns is a timestamp in nanoseconds (something of the sort of 1594686594613248). Before the casting, it has only positive values. After the casting, it has mostly negative values. Is there any kind of bug with astype('int32')? What am I missing here.

Relevant code:

data_uid_label = pd.read_csv('label_to_uid.csv')
types = data_uid_label.dtypes
for name in data_uid_label.columns:
    if(types[name]=='float64'):
        data_uid_label[name]=data_uid_label[name].astype('float32')
    if(types[name]=='int64'):
        data_uid_label[name]=data_uid_label[name].astype('int32')

thanks


Solution

  • 1594686594613248 needs 51 bits to be represented, so it fits in a 64 bit number (int64), but not in a 32 bit one (int32). It overflows:

    • All bits left of the 32nd one are truncated, ie. thrown away, resulting in completely different (smaller) values
    • Due to internal representation of integers, the new leftmost bit (the 32nd) determines whether the number is positive or negative, hence the negative results you are getting

    Only cast columns you are certain don't contain values which are too large to smaller types. Most of the time, it isn't even worth the minimal memory gain as long as you don't have millions of data points.