Update State column in a csv to display IL instead of Illinois in Python - Data Quality

I have a CSV file with State column, majority of the records have state as 2 characters (IL,CA,AZ) but there are few records that I need to clean up which spells the name of the state (Illinois, California, Arizona). Can anyone please suggest the code to clean up these states. For now, I can hard code only Illinois, but any advise on re-usability of the code to fix all the states in the USA to 2 char. Please advise.

This is the error I get while running a vsc file which has ~40k records:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4563: character maps to <undefined>

import geocoder
import unicodecsv
import logging
import time
from geopy.geocoders import Nominatim
pcode=[]
geolocator = Nominatim(user_agent="specify_your_app_name_here")
with open('Bookings.csv') as f:
     reader = csv.DictReader(f)
     for line in reader:
         if line['state'] == "Illinois":
            line['state'] = "IL"
f.to_csv("New_Bookings.csv", index=False)```

Solution

0x9d in UTF-8 would be a continuation byte attached to the preceding character. If the preceding byte was below 0x7f, then it would not be valid UTF-8. 0x9d is not a valid character in a Windows style encoding like ISO-8859-1 where each character is a single byte. This leads me to believe that the file is in fact UTF-8.

unicodecsv requires a bytestream to process: So add the b flag for this as well as r because you only want to read:

with open('Bookings.csv', 'rb') as f:

Also, specify the encoding on the reader as well:

csv.DictReader(f, encoding='utf-8')

The data you are changing is not staying in memory. It is only processing one line at a time. So in order to write the new file, you must open a csv writer to write each line within the loop. So something like this:

with open('Bookings.csv', 'rb') as f:
     reader = csv.DictReader(f, encoding='utf-8')
     with open(file_name, 'wb') as f_out:
         writer = csv.writer(f_out, encoding='utf-8')
         for line in reader:
             if line['state'] == "Illinois":
                line['state'] = "IL"
             writer.writerow(line)

Edit: So I was looking through some of the more common UTF-8 characters and found only one that contained 0x9d (that's 0xe2 0x80 0x9d to be exact). It is the special closing double quote. Open the file and search for this ” to verify...