Search code examples
pythonpandasdataframepython-3.7

why does dataframe keep reading file as integer while using dtype of string


i do use python 3.7 to automate some processes which include using dataframe

problem i got is as follow.

using this code:

data=pd.io.parsers.read_csv(basepath + files[0],sep='|',header=None,index_col=None,dtype={'2': 'str'},skiprows=2,usecols=[2,3,10,18,17,1])

The file is so huge its impossible to track every mistake with 00 , and not all number out there are 10 char long some are 9 char long it depends.

i expect result as follows:

4   12345   abcd   P1234   A1234

but some lines in column 2 are with 00 at start dataframe automaticaly thinks it's integer and get rid of it to be efficient so sometimes it should be:

4   00123   abcd   P1234   A1234

but i end up with

4   123   abcd   P1234   A1234

so i chcek documentation to pandas and tried adding dtype it doesn't work for me. Any suggestions how to make it work?


Solution

  • Your combination of header=None and dtype={'2': 'str'} are problematic. When pandas parses column headers it will always use the string representation. For a file like test.csv we get

    1,2.0,2,7
    1,2,03,03
    1,00,3,01
    
    pd.read_csv('test.csv').columns
    #Index(['1', '2.0', '2', '7'], dtype='object')
    

    However, when specifying header=None, pandas instead creates an Int64Index:

    pd.read_csv('test.csv', header=None).columns
    #Int64Index([0, 1, 2, 3], dtype='int64')
    

    So if you want the column with header '2' to be a string dtype, then you need to remove header=None, or if you just want the second column (counting from 0) we need to use the integer 2 in dtype.

    pd.read_csv('test.csv', header=None, dtype={2: 'str'})
    #   0    1   2  3
    #0  1  2.0   2  7
    #1  1  2.0  03  3
    #2  1  0.0   3  1
    
    pd.read_csv('test.csv', dtype={'2': 'str'})
    #   1  2.0   2  7   # <- This row now string column headers
    #0  1    2  03  3
    #1  1    0   3  1