Search code examples
pythondata-processing

Eliminate redundancies from a file using Python


How to condense, i.e. eliminate redundancies from, the following data:

code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20

The output should be in this way:

GB-ENG, 3521
RO-B, 9
DE-NW, 4
DE-BW, 3
DE-HH, 34
DE-BY, 20
BE-BRU, 27

Described by 1 canonical representation of each code, i.e. DE-BY, that would represent the sum total aggregated over the numbers that are associated with each instance of that code, e.g.:

code: DE-BY, jobs: 11
code: DE-BY, jobs: 9

becomes

DE-BY, 20

at the moment I'm creating that input with this Python script:

import json
import requests
from collections import defaultdict
from pprint import pprint

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

# open up the output of 'data-processing.py'
with open('job-numbers-by-location.txt') as data_file:

    # print the output to a file
    with open('phase_ii_output.txt', 'w') as output_file_:
        for line in data_file:
            identifier, name, coords, number_of_jobs = line.split("|")
            coords = coords[1:-1]
            lat, lng = coords.split(",")
            # print("lat: " + lat, "lng: " + lng)
            response = requests.get("http://api.geonames.org/countrySubdivisionJSON?lat="+lat+"&lng="+lng+"&username=s.matthew.english").json()


            codes = response.get('codes', [])
            for code in codes:
                if code.get('type') == 'ISO3166-2':
                    country_code = '{}-{}'.format(response.get('countryCode', 'UNKNOWN'), code.get('code', 'UNKNOWN'))
                    if not hasNumbers( country_code ):
                        # print("code: " + country_code + ", jobs: " + number_of_jobs)
                        output_file_.write("code: " + country_code + ", jobs: " + number_of_jobs)
    output_file_.close()

it would probably be most efficient to include this functionality as part of that script but I've not been able to yet figure out how.


Solution

  • assuming the text is stored in a text file, this would work

    infile = open('redundancy.txt','r')
    a= infile.readlines()
    print a
    d={}
    for item in a:
        c=item.strip('\n')    
        b=c.split()    
        if b[1] in d :
            d[b[1]]= int(d.get(b[1]))+eval((b[3]))
        else:
            d[b[1]]=b[3]
    print d
    

    it would give a result :

    {'DE-BY,': 20, 'DE-HH,': '34', 'DE-BW,': '3', 'DE-NW,': '4', 'RO-B,': '9', 'GB-ENG,': 3521, 'BE-BRU,': '27'}