why does dataframe keep reading file as integer while using dtype of string

i do use python 3.7 to automate some processes which include using dataframe

problem i got is as follow.

using this code:

data=pd.io.parsers.read_csv(basepath + files[0],sep='|',header=None,index_col=None,dtype={'2': 'str'},skiprows=2,usecols=[2,3,10,18,17,1])

The file is so huge its impossible to track every mistake with 00 , and not all number out there are 10 char long some are 9 char long it depends.

i expect result as follows:

4   12345   abcd   P1234   A1234

but some lines in column 2 are with 00 at start dataframe automaticaly thinks it's integer and get rid of it to be efficient so sometimes it should be:

4   00123   abcd   P1234   A1234

but i end up with

4   123   abcd   P1234   A1234

so i chcek documentation to pandas and tried adding dtype it doesn't work for me. Any suggestions how to make it work?

Solution

Your combination of header=None and dtype={'2': 'str'} are problematic. When pandas parses column headers it will always use the string representation. For a file like test.csv we get

1,2.0,2,7
1,2,03,03
1,00,3,01

pd.read_csv('test.csv').columns
#Index(['1', '2.0', '2', '7'], dtype='object')

However, when specifying header=None, pandas instead creates an Int64Index:

pd.read_csv('test.csv', header=None).columns
#Int64Index([0, 1, 2, 3], dtype='int64')

So if you want the column with header '2' to be a string dtype, then you need to remove header=None, or if you just want the second column (counting from 0) we need to use the integer 2 in dtype.

pd.read_csv('test.csv', header=None, dtype={2: 'str'})
#   0    1   2  3
#0  1  2.0   2  7
#1  1  2.0  03  3
#2  1  0.0   3  1

pd.read_csv('test.csv', dtype={'2': 'str'})
#   1  2.0   2  7   # <- This row now string column headers
#0  1    2  03  3
#1  1    0   3  1