Search code examples
python-3.xdata-files

Python3 how to combine 2 text files line by line conditionally


I have two ascii tables in text files containing information about stars, one of which with the headers

| ID | TIME | MAGNITUDE | ERROR |

and the other has the headers

| ID | CLASS |

and I want to add the CLASS column to the first text file. The main problem here is that the first text file has got many rows for each star (I.E. Star 3_6588 has got 20 entries in table a for different times) whereas the second text file has only one entry for each ID (as Star 3_6588 is always a Class I).

What I need to do is to add the |CLASS| column to the first table where every instance of a specific ID has the required class. The text file has over 14 million rows in it which is why I can't just do this manually.


Solution

  • @Terry Spotts has the right idea. However the leading and trailing | characters in the header line will make this a slightly tricky CSV, as the delimiter is a pipe char, but sometimes with a leading space, trailing space, or both. But here's an example to generate your ID: Class dictionary:

    > cat bigfile.txt
    | ID | TIME | MAGNITUDE | ERROR |
    | Star 3_6588 | 10 | 2 | 1.02 |
    | Star 3_6588 | 15 | 4 | 1.2 |
    | Star 2_999 | 20 | 6 | 1.4 |
    | Star 2_999 | 25 | 8 | 1.6 |
    
    > cat smallfile.txt
    | ID | CLASS |
    | Star 3_6588 | CLASS I |
    

    Code:

    id2class = {}
    with open('/tmp/smallfile.txt', 'r') as classfile:
        line = classfile.readline()        # Skip Header Line
        for line in classfile:
            line = line.rstrip('\n')[2:-2] # strip newline and the Pipe-Space / Space-Pipe and the start + end
            fields = line.split(' | ')     # Split on ' | '
            id = fields[0]
            starclass = fields[1]
            id2class[id] = starclass
    

    Now you have a dict id2class that looks like:

    {
        'Star 3_6588': 'CLASS I',
        'Star 2_999': 'CLASS II'
    }
    

    You can then parse the first file in a similar way, use the ID of each line to look up the Class in the dict, and write out the full data for the line to a new file. I'll leave that part to you :)

    Happy Coding!