Search code examples
pythoncsvpython-3.xtraceback

Writing a Array of Dictionaries to CSV


I'm trying to get the dictionary (which the first part of the program generates) to write to a csv so that I can perform further operations on the data in excel. I realize the code isn't efficient but at this point I'd just like it to work. I can deal with speeding it up later.

import csv
import pprint

raw_data = csv.DictReader(open("/Users/David/Desktop/crimestats/crimeincidentdata.csv", "r"))

neighborhood = []
place_count = {}
stats = []

for row in raw_data:
    neighborhood.append(row["Neighborhood"])

for place in set(neighborhood):
    place_count.update({place:0})

for key,value in place_count.items():
    for place in neighborhood:
        if key == place:
            place_count[key] = place_count[key]+1

for key in place_count:
    stats.append([{"Location":str(key)},{"Volume":str(place_count[key])}])

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(stats)

The program is still running fine here as is evident by the pprint output

[   [{'Location': 'LINNTON'}, {'Volume': '109'}],
    [{'Location': 'SUNDERLAND'}, {'Volume': '118'}],
    [{'Location': 'KENTON'}, {'Volume': '715'}]  

This is where the error is definitely happening. The program writes the headers to the csv just fine then throws the ValueError.

fieldnames = ['Location', 'Volume']
with open('/Users/David/Desktop/crimestats/localdata.csv', 'w', newline='') as output_file:
    csvwriter = csv.DictWriter(output_file, delimiter=',', fieldnames=fieldnames, dialect='excel')
    csvwriter.writeheader()
for row in stats:
    csvwriter.writerow(row)
output_file.close()

I've spent quite a bit of time searching for this problem but none of the suggestions I have attempted to use have worked. I figure I must me missing something so I'd really appreciate any and all help.

Traceback (most recent call last):
  File "/Users/David/Desktop/crimestats/statsreader.py", line 34, in <module>
    csvwriter.writerow(row)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 153, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 149, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: {'Location': 'SABIN'}, {'Volume': '247'}

Solution

  • I believe your problem is here:

    for key in place_count:
        stats.append([{"Location":str(key)},{"Volume":str(place_count[key])}])
    

    This is creating a list of two dictionaries. The first has only a "Location" key, and the second has only a "Volume" key. However, the csv.DictWriter objects are expecting a single dictionary per row, with all the keys in the dictionary. Change that code snippet to the following and it should work:

    for key in place_count:
        stats.append({"Location": str(key), "Volume": str(place_count[key])})
    

    That should take care of the errors you're seeing.

    Now, as for why the error message is complaining about fields not in fieldnames, which completely misled you away from the real problem you're having: the writerow() function expects to get a dictionary as its row parameter, but you're passing it a list. The result is confusion: it iterates over the dict in a for loop expecting to get the dict's keys (because that's what you get when you iterate over a dict in Python), and it compares those keys to the values in the fieldnames list. What it's expecting to see is:

    "Location"
    "Volume"
    

    in either order (because a Python dict makes no guarantees about which order it will return its keys). The reason why they want you to pass in a fieldnames list is so that the fields can be written to the CSV in the correct order. However, because you're passing in a list of two dictionaries, when it iterates over the row parameter, it gets the following:

    {'Location': 'SABIN'}
    {'Volume': '247'}
    

    Now, the dictionary {'Location': 'SABIN'} does not equal the string "Location", and the dictionary {'Volume': '247'} does not equal the string "Volume", so the writerow() function thinks it's found dict keys that aren't in the fieldnames list you supplied, and it throws that exception. What was really happening was "you passed me a list of two dicts-of-one-key, when I expected a single dict-with-two-keys", but the function wasn't written to check for that particular mistake.


    Now I'll mention a couple things you could do to speed up your code. One thing that will help quite a bit is to reduce those three for loops at the start of your code down to just one. What you're trying to do is to go through the raw data, and count the number of times each neighborhood shows up. First I'll show you a better way to do that, then I'll show you an even better way that improves on my first solution.

    The better way to do that is to make use of the wonderful defaultdict class that Python provides in the collections module. defaultdict is a subclass of Python's dictionary type, which will automatically create dict entries when they're accessed for the first time. Its constructor takes a single parameter, a function which will be called with no parameters and should return the desired default value for any new item. If you had used defaultdict for your place_count dict, this code:

    place_count = {}
    for place in set(neighborhood):
        place_count.update({place:0})
    

    could simply become:

    place_count = defaultdict(int)
    

    What's going on here? Well, the int function (which really isn't a function, it's the constructor for the int class, but that's a bit beyond the scope of this explanation) just happens to return 0 if it's called with no parameters. So instead of writing your own function def returnzero(): return 0, you can just use the existing int function (okay, constructor). Now every time you do place_count["NEW PLACE"], the key NEW PLACE will automatically appear in your place_count dictionary, with the value 0.

    Now, your counting loop needs to be modified too: it used to go over the keys of place_count, but now that place_count automatically creates its keys the first time they're accessed, you need a different source. But you still have that source in the raw data: the row["Neighborhood"] value for each row. So your for key,value in place_count.items(): loop could become:

    for row in raw_data:
        place = row["Neighborhood"]
        place_count[place] = place_count[place] + 1
    

    And now that you're using a defaultdict, you don't even need that first loop (the one that created the neighborhood list) at all! So we've just turned three loops into one. The final version of what I'm suggesting looks like this:

    from collections import defaultdict
    place_count = defaultdict(int)
    for row in raw_data:
        place = row["Neighborhood"]
        place_count[place] = place_count[place] + 1
        # Or: place_count[place] += 1
    

    However, there's a way to improve that even more. The Counter object from the collections module is designed for just this case, and has some handy extra functionality, like the ability to retrieve the N most common items. So the final final version :-) of what I'm suggesting is:

    from collections import Counter
    place_count = Counter()
    for row in raw_data:
        place = row["Neighborhood"]
        place_count[place] = place_count[place] + 1
        # Or: place_count[place] += 1
    

    That way if you need to retrieve the 5 most crime-ridden neighborhoods, you can just call place_count.most_common(5).

    You can read more about Counter and defaultdict in the documentation for the collections module.