Search code examples
pythoncsvpandasdataframedouble-quotes

How to remove double quotes from index of csv file in python


I'm trying to read multiple csv files with python. The index of raw data(or the first column) has a little problem, the partial csv file looks like this:

NoDemande;"NoUsager";"Sens";"IdVehiculeUtilise";"NoConducteur";"NoAdresse";"Fait";"HeurePrevue"
42210000003;"42210000529";"+";"265Véh";"42210000032";"42210002932";"1";"25/07/2015 10:00:04"
42210000005;"42210001805";"+";"265Véh";"42210000032";"42210002932";"1";"25/07/2015 10:00:04"
42210000004;"42210002678";"+";"265Véh";"42210000032";"42210002932";"1";"25/07/2015 10:00:04"
42210000003;"42210000529";"—";"265Véh";"42210000032";"42210004900";"1";"25/07/2015 10:50:03"
42210000004;"42210002678";"—";"265Véh";"42210000032";"42210007072";"1";"25/07/2015 11:25:03"
42210000005;"42210001805";"—";"265Véh";"42210000032";"42210004236";"1";"25/07/2015 11:40:03"

The first index has no "", after reading the file, it looks like: "NoDemande" while others have no "", and the rest of column looks just fine, which makes the result looks like(not the same lines):

"NoDemande"     NoUsager Sens IdVehiculeUtilise NoConducteur    NoAdresse Fait          HeurePrevue
42209000003  42209001975    +            245Véh  42209000002  42209005712    1   24/07/2015 06:30:04
42209000004  42209002021    +            245Véh  42209000002  42209005712    1   24/07/2015 06:30:04
42209000005  42209002208    +            245Véh  42209000002  42209005713    1   24/07/2015 06:45:04
42216000357  42216001501    -            190Véh  42216000139  42216001418    1   31/07/2015 17:15:03
42216000139  42216000788    -         309V7pVéh  42216000059  42216006210    1   31/07/2015 17:15:03
42216000118  42216000188    -            198Véh  42216000051  42216006374    1   31/07/2015 17:15:03

It causes problem identifying name of index in the coming moves. How to solve this problem? Here's my code of reading files:

import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)
path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []

for file_ in allFiles:
    #Read file
    df = pd.read_csv(file_,header=0,sep=';',dayfirst=True,encoding='utf8',
                     dtype='str')

    df['Sens'].replace(u'\u2014','-',inplace=True)

    list_.append(df)
    print"fichier lu ",file_

frame = pd.concat(list_)
print frame

Solution

  • I think the simpliest is set new column names:

    df.columns = ['NoDemande1'] + df.columns[1:].tolist()
    print (df)
        NoDemande1     NoUsager Sens IdVehiculeUtilise  NoConducteur    NoAdresse  \
    0  42210000003  42210000529    +            265Véh   42210000032  42210002932   
    1  42210000005  42210001805    +            265Véh   42210000032  42210002932   
    2  42210000004  42210002678    +            265Véh   42210000032  42210002932   
    3  42210000003  42210000529    -           265Véh   42210000032  42210004900   
    4  42210000004  42210002678    -           265Véh   42210000032  42210007072   
    5  42210000005  42210001805    -           265Véh   42210000032  42210004236   
    
       Fait          HeurePrevue  
    0     1  25/07/2015;10:00:04  
    1     1  25/07/2015;10:00:04  
    2     1  25/07/2015;10:00:04  
    3     1  25/07/2015;10:50:03  
    4     1  25/07/2015;11:25:03  
    5     1  25/07/2015;11:40:03  
    

    Another solution is strip values " from column names:

    print (df)
       "NoDemande"     NoUsager Sens IdVehiculeUtilise  NoConducteur    NoAdresse  \
    0  42210000003  42210000529    +            265Véh   42210000032  42210002932   
    1  42210000005  42210001805    +            265Véh   42210000032  42210002932   
    2  42210000004  42210002678    +            265Véh   42210000032  42210002932   
    3  42210000003  42210000529    -           265Véh   42210000032  42210004900   
    4  42210000004  42210002678    -           265Véh   42210000032  42210007072   
    5  42210000005  42210001805    -           265Véh   42210000032  42210004236   
    
       Fait          HeurePrevue  
    0     1  25/07/2015;10:00:04  
    1     1  25/07/2015;10:00:04  
    2     1  25/07/2015;10:00:04  
    3     1  25/07/2015;10:50:03  
    4     1  25/07/2015;11:25:03  
    5     1  25/07/2015;11:40:03
    
    df.columns = df.columns.str.strip('"')
    print (df)
         NoDemande     NoUsager Sens IdVehiculeUtilise  NoConducteur    NoAdresse  \
    0  42210000003  42210000529    +            265Véh   42210000032  42210002932   
    1  42210000005  42210001805    +            265Véh   42210000032  42210002932   
    2  42210000004  42210002678    +            265Véh   42210000032  42210002932   
    3  42210000003  42210000529    -            265Véh   42210000032  42210004900   
    4  42210000004  42210002678    -            265Véh   42210000032  42210007072   
    5  42210000005  42210001805    -            265Véh   42210000032  42210004236   
    
       Fait          HeurePrevue  
    0     1  25/07/2015;10:00:04  
    1     1  25/07/2015;10:00:04  
    2     1  25/07/2015;10:00:04  
    3     1  25/07/2015;10:50:03  
    4     1  25/07/2015;11:25:03  
    5     1  25/07/2015;11:40:03