I have an excel .csv file which looks like this
id,ticker1,ticker2,Company,Supplier
9,MMM,DHR,3M Company,Danaher Corporation
10,MMM,IBM,3M Company,International Business Machines Corporation
11,MMM,UNP,3M Company,Union Pacific Corporation
12,MMM,TMO,3M Company,Thermo Fisher Scientific Inc.
13,MMM,NSC,3M Company,Norfolk Southern Corporation
61,No,0,,
14,MMM,TER,3M Company,"Teradyne, Inc."
The data here is comma seperated so I use delimiter=','
, but in some cases as in last line there is comma in text too. In my csv file all strings with comma in it are in quotation mark " "
. I have tried to read the file with open()
with open('file.csv','r') as f:
for row in f.read().split('\n'):
But this way I have to handle commas and empty values. So I am curious which is the correct way of working with damaged csv files? should I use regex expressions or libs like pandas ?
You should use pythons csv
module. its designed for exactly this purpose. To take away the complexity of things like parsing csv data where the deliminator can appear inside fields etc.
here is a simple example. I use the csv
modules DictReader
this will read the first line of the CSV as key names then map each subsequent row as key values. it automatically takes care of the comma inside the supplier field.
import csv
with open("sample.csv") as my_csv:
for row in csv.DictReader(my_csv):
print(f'id {row["id"]} has supplier "{row["Supplier"]}"')
OUTPUT
id 9 has supplier "Danaher Corporation"
id 10 has supplier "International Business Machines Corporation"
id 11 has supplier "Union Pacific Corporation"
id 12 has supplier "Thermo Fisher Scientific Inc."
id 13 has supplier "Norfolk Southern Corporation"
id 61 has supplier ""
id 14 has supplier "Teradyne, Inc."