Search code examples
pythoncsvstreet-address

Output a CHANGING OrderedDict to CSV


I've poured over this post, but the answers don't seem to work for my needs. However, I'm very new to Python so that might be the problem as well.

Here's some lines from the output.csv:
Case Parties Address
25 THOMAS ST., PORTAGE, IN
67 CHESTNUT ST., MILLBROOK, NJ
1 EMPIRE DR., AUSTIN, TX, 11225
111 WASHINGTON AVE. #404, VALPARAISO, AK
89 E. JERICHO TPKE., Scarssdale, AZ

ORIGINAL POST CODE

import usaddress
import csv

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        print(data)

(OrderedDict([('AddressNumber', u'4167'), ('StreetNamePreType', u'Highway'), ('StreetName', u'319'), ('StreetNamePostDirectional', u'E'), ('PlaceName', u'Conway'), ('StateName', u'SC'), ('ZipCode', u'29526-5446')]), 'Street Address'

Much like the previous post, I need to output the parsed data into csv. As best as I can tell, I need to do the following steps:

  1. Provide the headers as a list for reference. (They're listed here in 'Details'.)
  2. Using Usaadress.tag(), parse the source_csv as "data" BUT keep their corresponding "keys."
  3. Map the key:data to the header_reference
  4. Export into output_csv that has one header row.

I'm using the Python module, usaaddress, to parse a large csv (200k+). The module outputs the parsed data using OrderedDict. The aforementioned post only works if all the fields map to the same headers for all records. However, one of the MANY benefits of usaddress is that it parses out data even when there isn't fields to parse. So, for example, "123 Fake St, Maine, PA" maps to address,city,state headers perfectly. But "123 Jumping Block, Suite 600, Maine, PA" will put the "Suite 600" in the "city" column since it's matching statically based on position. If I parse the latter by itself, usaddress provides address, occupancy identifier (e.g. "suite #"), city, state headers.

The output format that I need is provided when I use the parserator's online parser, but it only can accommodate 500 rows at a time.

It seems like my code won't know what each data point is until it gets routed through the module; a chicken-or-the-egg situation. How do I write rows to a CSV file when each row might have a different subset of columns?

For reference, the error that I get when I try the closest solution (provided by isosceleswheel) is valueerror: I/O(...) and they reference lines 107 and 90 of the csv.py library, both of which pertain to fieldnames.

with open('output.csv') as csvfile:
reader = csv.DictReader(csvfile)

with open('myoutputfile', 'w') as o:  # this will be the new file you write to
    for row in reader:
        addr=row['Case Parties Address']
        data = usaddress.tag(addr)
        header = ','.join(data.keys()) + '\n'  # this will make a string of the header separated by comma with a newline at the end
        data_string = ','.join(data.values()) + '\n' # this will make a string of the values separated by comma with a newline at the end
        o.write(header + data_string)  # this will write the header and then the data on a new line with each field separated by commas

Solution

  • see this github issue for a solution

    since we know all the possible labels in usaddress, we can use them to define the fields in the output.

    I can't comment on an answer b/c I don't have enough reputation, but I'd recommend against using the usaddress parse method for this task. The tag method will parse an address and then concatenate consecutive address tokens when they have the same label, and will raise an error if there are non-consecutive tokens with the same label - it'd be good to capture tagging errors in the output.