Search code examples
pythonparsingsortingtextreduction

Python data manipulation


I have two files as input. (they each have more columns, but I narrowed it down to important ones only)

A   15.6            A   D
B   10.3            A   B
C   12.5            A   E
D   14.5            A   Y
E   11.4            C   A
F   23.7            C   B
                    C   R
                    D   A
                    D   R
                    D   F

First file is a kind of index. I want to look at the second file and compare the pairs by looking up their value in the first file and printing out the key with smaller value (if one of the keys isn't in the index file - then print out the other one by default). After that I'd like to remove all repearting entries, i.e.

D   14.5
B   10.3
E   11.4                A   15.6
A   15.6                B   10.3
C   12.5    ------->    C   12.5
B   10.3                D   14.5
C   12.5                E   11.4
D   14.5
D   14.5
D   14.5

So, it's essentially an index file reduction. There has to be an elegant way in Python for doing it...


Solution

  • mapping = dict()
    result = set()
    
    with open(filename1, 'r') as f1, open(filename2, 'r') as f2:
        for line in f1:
            line = line.split()
            if line:
                key, val = line
                mapping[key] = float(val)  #1
    
        for line in f2:
            line = line.split()        
            if line:
                key1, key2 = line
                if key1 in mapping:   #4
                    result.add(min(line, key=lambda x: mapping.get(x, float('inf'))))  #2
    
    for key in result:
        print('{k} {v}'.format(k=key, v=mapping[key]))   #3
    
    1. Load the data from the first file into a dict (called mapping).
    2. Collect all the keys associated with minimal values in a set (called result).
    3. Report the keys. Note that since result is a set, there is no predefined order in which the keys will be reported.
    4. Per the extra requirement in the comments, ignore rows where key1 is not in the first file.