import pandas as pd
import io
indata = io.StringIO("c\n10000000000")
df = pd.read_csv(indata, header=0)
print(df)
indata.seek(0)
df = pd.read_csv(indata, header=0, dtype={"c":int})
print(df)
Expected Output:
c
0 10000000000
c
0 10000000000
Actual Output:
c
0 10000000000
c
0 1410065408
Can pandas truncate my data this way with no warning whatsoever?
I was banging my head trying to figure out why my script didn't work (of course this is a toy example. My script is more complicated). After 45 minutes of desperation (trying also to figure out the data type that pandas assigned to my columns) I just discovered the behaviour above.
I set the dtype
in my real script because pandas was keeping loading that column as a float
but I needed it as a int
to make comparisons.
EDIT: Additional information as requested in the comments:
Python version
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Pandas version: 1.1.3
Platform:
>>> platform.platform()
'Windows-10-10.0.18362-SP0'
>>> platform.processor()
'Intel64 Family 6 Model 158 Stepping 10, GenuineIntel'
>>> platform.version()
'10.0.18362'
I see what's happening here. From the pandas documentation:
dtypeType name or dict of column -> type, optional Data type for data or columns. E.g.
{‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
Usestr
orobject
together with suitablena_values
settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
So, it is mentioned in CAPITAL LETTERS that read_csv()
will use the dtype converter if you specify one. So, passing int
is like explicitly telling it to use numpy equivalent of int
. That is why there is no warnings, and it should be considered expected behavior.
Now, the question is why my numpy equivalent of int
is int32
instead of int64
?
The numpy (doc) maps python's int
to built-in scalar np.int_
with the following warning:
The numpy documentation specifies that the built-in scalar np.int_
is platform dependent:
TL;DR int(python) -> int_(numpy) -> long(C)
So, the question is what does long
mean for your system?
For MSC, long
is 4 bytes as shown in the docs:
and confirmed by numpy:
For GCC, long
is 8 bytes as confirmed here:
Hope this was useful, and you learned something new. 🙂