I have a CSV file with State column, majority of the records have state as 2 characters (IL,CA,AZ) but there are few records that I need to clean up which spells the name of the state (Illinois, California, Arizona). Can anyone please suggest the code to clean up these states. For now, I can hard code only Illinois, but any advise on re-usability of the code to fix all the states in the USA to 2 char. Please advise.
This is the error I get while running a vsc file which has ~40k records:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4563: character maps to <undefined>
import geocoder
import unicodecsv
import logging
import time
from geopy.geocoders import Nominatim
pcode=[]
geolocator = Nominatim(user_agent="specify_your_app_name_here")
with open('Bookings.csv') as f:
reader = csv.DictReader(f)
for line in reader:
if line['state'] == "Illinois":
line['state'] = "IL"
f.to_csv("New_Bookings.csv", index=False)```
0x9d
in UTF-8 would be a continuation byte attached to the preceding character. If the preceding byte was below 0x7f
, then it would not be valid UTF-8. 0x9d
is not a valid character in a Windows style encoding like ISO-8859-1 where each character is a single byte. This leads me to believe that the file is in fact UTF-8.
unicodecsv
requires a bytestream to process: So add the b
flag for this as well as r
because you only want to read:
with open('Bookings.csv', 'rb') as f:
Also, specify the encoding on the reader as well:
csv.DictReader(f, encoding='utf-8')
The data you are changing is not staying in memory. It is only processing one line at a time. So in order to write the new file, you must open a csv writer
to write each line within the loop. So something like this:
with open('Bookings.csv', 'rb') as f:
reader = csv.DictReader(f, encoding='utf-8')
with open(file_name, 'wb') as f_out:
writer = csv.writer(f_out, encoding='utf-8')
for line in reader:
if line['state'] == "Illinois":
line['state'] = "IL"
writer.writerow(line)
Edit: So I was looking through some of the more common UTF-8 characters and found only one that contained 0x9d
(that's 0xe2 0x80 0x9d
to be exact). It is the special closing double quote. Open the file and search for this ”
to verify...