Search code examples
pythonpandasfloating-pointtruncate

Can pandas truncate my data and cause irreparable data loss without any kind of warning whatsoever?


import pandas as pd
import io

indata = io.StringIO("c\n10000000000")

df = pd.read_csv(indata, header=0)
print(df)

indata.seek(0)

df = pd.read_csv(indata, header=0, dtype={"c":int})
print(df)

Expected Output:

             c
0  10000000000
            c
0  10000000000

Actual Output:

             c
0  10000000000
            c
0  1410065408

Can pandas truncate my data this way with no warning whatsoever?

I was banging my head trying to figure out why my script didn't work (of course this is a toy example. My script is more complicated). After 45 minutes of desperation (trying also to figure out the data type that pandas assigned to my columns) I just discovered the behaviour above.

I set the dtype in my real script because pandas was keeping loading that column as a float but I needed it as a int to make comparisons.

EDIT: Additional information as requested in the comments:

Python version

Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32

Pandas version: 1.1.3

Platform:

>>> platform.platform()
'Windows-10-10.0.18362-SP0'
>>> platform.processor()
'Intel64 Family 6 Model 158 Stepping 10, GenuineIntel'
>>> platform.version()
'10.0.18362'

Solution

  • I see what's happening here. From the pandas documentation:

    dtypeType name or dict of column -> type, optional Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

    So, it is mentioned in CAPITAL LETTERS that read_csv() will use the dtype converter if you specify one. So, passing int is like explicitly telling it to use numpy equivalent of int. That is why there is no warnings, and it should be considered expected behavior.


    Now, the question is why my numpy equivalent of int is int32 instead of int64?

    The numpy (doc) maps python's int to built-in scalar np.int_ with the following warning:

    enter image description here

    The numpy documentation specifies that the built-in scalar np.int_ is platform dependent:

    enter image description here

    TL;DR int(python) -> int_(numpy) -> long(C)

    So, the question is what does long mean for your system?

    For MSC, long is 4 bytes as shown in the docs:

    enter image description here

    and confirmed by numpy:

    enter image description here

    For GCC, long is 8 bytes as confirmed here:

    enter image description here


    Hope this was useful, and you learned something new. 🙂