i do use python 3.7 to automate some processes which include using dataframe
problem i got is as follow.
using this code:
data=pd.io.parsers.read_csv(basepath + files[0],sep='|',header=None,index_col=None,dtype={'2': 'str'},skiprows=2,usecols=[2,3,10,18,17,1])
The file is so huge its impossible to track every mistake with 00 , and not all number out there are 10 char long some are 9 char long it depends.
i expect result as follows:
4 12345 abcd P1234 A1234
but some lines in column 2 are with 00 at start dataframe automaticaly thinks it's integer and get rid of it to be efficient so sometimes it should be:
4 00123 abcd P1234 A1234
but i end up with
4 123 abcd P1234 A1234
so i chcek documentation to pandas and tried adding dtype it doesn't work for me. Any suggestions how to make it work?
Your combination of header=None
and dtype={'2': 'str'}
are problematic. When pandas parses column headers it will always use the string representation. For a file like test.csv
we get
1,2.0,2,7
1,2,03,03
1,00,3,01
pd.read_csv('test.csv').columns
#Index(['1', '2.0', '2', '7'], dtype='object')
However, when specifying header=None
, pandas instead creates an Int64Index:
pd.read_csv('test.csv', header=None).columns
#Int64Index([0, 1, 2, 3], dtype='int64')
So if you want the column with header '2'
to be a string dtype, then you need to remove header=None
, or if you just want the second column (counting from 0) we need to use the integer 2
in dtype.
pd.read_csv('test.csv', header=None, dtype={2: 'str'})
# 0 1 2 3
#0 1 2.0 2 7
#1 1 2.0 03 3
#2 1 0.0 3 1
pd.read_csv('test.csv', dtype={'2': 'str'})
# 1 2.0 2 7 # <- This row now string column headers
#0 1 2 03 3
#1 1 0 3 1