Search code examples
pythoncsvformattingformattabular

How to Alter the Structure of a File into a Tabular Format?


I have a file contains the following data:

Input:

Query= A1 bird
Hit= B1 owl
Score= 1.0 4.0 2.5
Hit= B2 bluejay
Score= 10.0 6.0 7.0
Query= A2 shark
Hit= C1 catshark
Score= 10.0 7.0 2.0
Query= A3 cat
Hit= D1 dog
Score= 7.0 2.0 1.0

I want to write a program that manipulates the structure of data to make it in a tabular (.csv) format ... something like the following:

Output:

Query = A1 bird, Hit= B1 owl, Score= 1.0 4.0 2.5 #The first query, hit, score 
Query = A1 bird, Hit= B2 bluejay, Score= 10.0 6.0 7.0 #The second hit and score associated with the first query
Query = A2 shark, Hit= C1 catshark, Score= 10.0 7.0 2.0 #The second query, hit, socre
Query = A3 cat, Hit= D1 dog, Score= 7.0 2.0 1.0 #The third query, hit, score

I tried to doing the following suggested solution by Takis:

with open('g.txt', 'r') as f, open('result.csv', 'w') as csvfile:
fieldnames = ['Query', 'Hit', 'Score']
csvwriter = csv.DictWriter(csvfile, quoting=csv.QUOTE_ALL, 
                           fieldnames=fieldnames)
csvwriter.writeheader()
data = {}
for line in f:
    key, value = line.split('=')
    data[key.strip()] = value.strip()
    if len(data.keys()) == 3:
        csvwriter.writerow(data)
        data = {}

Question: How to make the program recognizes the Hits and Scores associated with each Query, so I can print them in one line? And if a Query has multiple Hits and Scores under it (associated with) print the query, the second Hit, and the second Score. Exactly like the following output:

"A1 bird","B1 owl","1.0 4.0 2.5" #1st Query, its 1st Hit, its 1st Score
"A1 bird","B2 bluejay", "10.0 6.0 7.0" #1st Query, its 2nd Hit, its 2nd Score
"A2 shark","C1 catshark", "10.0 7.0 2.0"#2nd Query, 1st and only Hit, 1st and only Score
"A3 cat","D1 dog","7.0 2.0 1.0"#3d Query, 1st and only Hit, 1st and only Score  

Any idea?


Solution

  • I would use the DictWriter class in the csv package to write the parsed data to CSV. There's no error handling, the program assumes the three needed data items will appear for each query, although they do not need to given in the same order for each query.

    import csv
    
    with open('g.txt', 'r') as f, open('result.csv', 'w') as csvfile:
        fieldnames = ['Query', 'Hit', 'Score']
        csvwriter = csv.DictWriter(csvfile, quoting=csv.QUOTE_ALL, 
                                   fieldnames=fieldnames)
        csvwriter.writeheader()
        data = {}
        for line in f:
            key, value = line.split('=')
            data[key.strip()] = value.strip()
            if len(data.keys()) == 3:
                csvwriter.writerow(data)
                data = {}