Search code examples
pythoncsvexport-to-csv

Update State column in a csv to display IL instead of Illinois in Python - Data Quality


I have a CSV file with State column, majority of the records have state as 2 characters (IL,CA,AZ) but there are few records that I need to clean up which spells the name of the state (Illinois, California, Arizona). Can anyone please suggest the code to clean up these states. For now, I can hard code only Illinois, but any advise on re-usability of the code to fix all the states in the USA to 2 char. Please advise.

This is the error I get while running a vsc file which has ~40k records:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4563: character maps to <undefined>
import geocoder
import unicodecsv
import logging
import time
from geopy.geocoders import Nominatim
pcode=[]
geolocator = Nominatim(user_agent="specify_your_app_name_here")
with open('Bookings.csv') as f:
     reader = csv.DictReader(f)
     for line in reader:
         if line['state'] == "Illinois":
            line['state'] = "IL"
f.to_csv("New_Bookings.csv", index=False)```

Solution

  • 0x9d in UTF-8 would be a continuation byte attached to the preceding character. If the preceding byte was below 0x7f, then it would not be valid UTF-8. 0x9d is not a valid character in a Windows style encoding like ISO-8859-1 where each character is a single byte. This leads me to believe that the file is in fact UTF-8.

    unicodecsv requires a bytestream to process: So add the b flag for this as well as r because you only want to read:

    with open('Bookings.csv', 'rb') as f:
    

    Also, specify the encoding on the reader as well:

    csv.DictReader(f, encoding='utf-8')
    

    The data you are changing is not staying in memory. It is only processing one line at a time. So in order to write the new file, you must open a csv writer to write each line within the loop. So something like this:

    with open('Bookings.csv', 'rb') as f:
         reader = csv.DictReader(f, encoding='utf-8')
         with open(file_name, 'wb') as f_out:
             writer = csv.writer(f_out, encoding='utf-8')
             for line in reader:
                 if line['state'] == "Illinois":
                    line['state'] = "IL"
                 writer.writerow(line)
    
    

    Edit: So I was looking through some of the more common UTF-8 characters and found only one that contained 0x9d (that's 0xe2 0x80 0x9d to be exact). It is the special closing double quote. Open the file and search for this to verify...