Search code examples
pythonregexcsvfindallpostal-code

Output a re.findall() to CSV in addition to all input data


I am trying to save the results of a regex .findall() search into a csv but am having difficulties appending the result to an output file.

As I am still very new to Python I am attempting to limit this problem to using only the csv and re libraries - but if there is a far easier way (i.e. in pandas) that would also be helpful to know.


  1. How would one copy the entire contents of the input CSV to the output CSV and add the postcode / found regular expression to the line it was found on?

  2. Are there any obvious forms of error checking or other things that I am missing?

  3. Does a better method exist to add the header of the input CSV automatically to the output CSV without explicitly specifying them?

  4. Is it possible to do this using DictWriter? As I had originally attempted.


import csv, re

pattern = r'[A-Z]{1,2}[0-9R][0-9A-Z]?[0-9][A-Z]{2}'
postcodes = []
with open(r'Postcode/addressin.csv', 'r') as csvinput:
    csv_reader = csv.DictReader(csvinput)

    with open(r'Postcode/addressout.csv', 'w', newline='') as csvoutput:
        fieldnames = ['Address', 'Name']
        csv_writer = csv.writer(csvoutput)

        csv_writer.writerow(fieldnames)

        for line in csv_reader:
            postcodes = re.findall(pattern, line["Address"])
            csv_writer.writerow(postcodes)

Example Data:

Address,Name,Lat,Long,2016 Sales,Type
48  Park Avenue, LATTON, SN6 4SZ,Nikki Yellowbeard,-23.17549,36.74641,9727,AA
IV21 1TD 116  Walwyn Rd CHARLESTOWN,Jonh Doe,-10.98309,156.41854,11932,AE

Solution

  • It seems to me that there are commas in the first field addresses that would create irregularities, and I'm not so sure what might be the best method to bypass those, yet this expression:

    (.*),(.*),\s*([0-9.-]+)\s*,\s*([0-9.]+)\s*,([0-9]{4,5}(?:-[0-9]{4})?)\s*,\s*([A-Z]{2})
    

    might be an approach to look into.


    Demo


    US Zip codes are usually in these formats:

    ([0-9]{5}(?:-[0-9]{4})?)
    

    just for demoing, I have included:

    [0-9]{4,5}
    

    which you can simply remove that.

    Example

    import re
    
    regex = r"(.*),(.*),\s*([0-9.-]+)\s*,\s*([0-9.]+)\s*,([0-9]{4,5}(?:-[0-9]{4})?)\s*,\s*([A-Z]{2})"
    
    test_str = ("Address,Name,Lat,Long,2016 Sales,Type\n"
        "48  Park Avenue, LATTON, SN6 4SZ,Nikki Yellowbeard,-23.17549,36.74641,9727,AA\n"
        "IV21 1TD 116  Walwyn Rd CHARLESTOWN,Jonh Doe,-10.98309,156.41854,11932,AE")
    
    matches = re.finditer(regex, test_str, re.MULTILINE)
    
    for matchNum, match in enumerate(matches, start=1):
        
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
        
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
            
            print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    

    If we would not be validating values, then simply this expression

    (.*),(.*),(.*),(.*),(.*),(.*)
    

    might likely work.

    Demo