Search code examples
pythoncsvpython-2.7python-itertools

Selecting data from groups within a csv and appending data to text file


I have a problem which I do not know how to solve currently. I have a csv with the format as shown below. Now what i need to do is perform some match scenarios and append some text strings to a file.

x,classA,uniqueclassindicator1,1,125,21.8,1,5.22,
x,classc,uniqueclassindicator1,3,125,21.8,2,5.22,
x,classd,uniqueclassindicator2,1,125,21.8,,,
x,classe,uniqueclassindicator2,2,125,21.8,,,
x,classBa,uniqueclassindicator2,3,125,21.8,,,
x,classBc,uniqueclassindicator2,4,125,21.8,,,
x,classAd,uniqueclassindicator3,1,125,21.8,2,2.56,
x,classc,uniqueclassindicator3,2,125,21.8,1,2.56,
x,classD,uniqueclassindicator3,3,125,21.8,,,
x,classa,uniqueclassindicator3,4,125,21.8,,,
x,classn,uniqueclassindicator4,1,125,21.8,,,
x,classm,uniqueclassindicator4,2,125,21.8,,,
x,classt,uniqueclassindicator4,3,125,21.8,,,
x,classd,uniqueclassindicator4,4,125,30.8,,,
x,classa,uniqueclassindicator4,5,125,31.8,,,
x,classn,uniqueclassindicator4,6,125,30.8,,,
x,classq,uniqueclassindicator5,1,125,35.8,1,3.31,3.1
x,classqe,uniqueclassindicator5,2,125,21.8,2,3.31,3.1 
x,classS,uniqueclassindicator5,3,125,21.8,3.31,3.1
x,classK,uniqueclassindicator5,4,125,21.8,,,
x,classL,uniqueclassindicator5,5,125,21.8,,,
x,classG,uniqueclassindicator5,6,125,21.8,,,
x,classH,uniqueclassindicator6,1,125,35.8,1,2.89,2.25   
x,classF,uniqueclassindicator6,2,125,21.8,3,2.89,2.25
x,classP,uniqueclassindicator6,3,125,21.8,2,2.89,2.25
x,classY,uniqueclassindicator6,4,125,21.8,,,
x,classU,uniqueclassindicator6,5,125,21.8,,,
x,classR,uniqueclassindicator6,6,125,21.8,,,

Throughout this example assume zero based indexing

You will notice that in the csv there column 2 is the uniqueclassindicator and I need to perform the following for each class.

1.

If column 3 & column 6 are 1, and for the same uniqueclass (different row) in column 3 & 6 are both 2 then produce the string:

   "text data text" (column [1]) #where row = 1# "text data" column [1] #where row =2# "text" (column[17])`

for example, in row 15 we have this exact scenario. So the string text string would need to read: text data text classq text data classqe text 3.31

In the above text string, "classq" was pulled from column 1 row column 15, "classqe" was pulled from column1 row 16 and "3.31" was pulled from column8 row 15.

Just to reiterate, the match that occured to produce this string is for uniqueclassindicator5 within this class both column 3 & column 6 matched (1-1 & 2-2)

2.

Is the pretty much the same as 1, however it is when column 3 and column 6 are 1,2 & 2,1. This occurs in uniqueclassindicator3 see row 7 as an example. So we would need to append the string:

text data text classc text data classAd text 2.56 #Note I have listed the class which had a 1 in column 16 first.`

3.

This scenario is for when the 1,2,3 for column 3 for a given class matches the 1,2,3 for column 6, fortunately here we only need to return the 8 value in a string such as:

test data test data (column[8]) test data test

4

like scenario 2 is when the same thing happens but not in the correct order. So if column 3 for a given uniqueclassinidcator =1-3 and column 6 = 1-3 (with the exception of scenario 3 just descibed.) Then create the following string.

data data (column[8]) data data.

I know the code required to do this is not the most simple, but if anyone could help me achieve this I would be very much in their debt. If anything is not clear please do not hesitate to contact me. Many thanks SMNALLY

Edit - Upon running code supplied by Martijn Pieters

I attempted to run the following code to match objectives 1,2 & 3. Whilst I can get objectives 1 & 2 working easy enough. I could not get Objective 3 to work.

from collections import defaultdict
import csv

# you probably can think up better names
fields = ('x', 'class', 'indicator', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8')

entries = defaultdict(list)

with open('test.csv', 'rb') as fd:
    reader = csv.DictReader(fd, fields)

    for row in reader:
        # each row is now a dictionary
        # make your numbers, numbers
        for field in fields[3:]:
            row[field] = row[field] and float(row[field])

        previous = entries[row['indicator']]
        for p in previous:

            ##Objective 1
            if (row['col3'], row['col6']) == (2, 2) and (p['col3'], p['col6']) == (1, 1):
                print 'text {p[class]} text {r[class]} text {r[col7]}'.format(p=p, r=row)
            # etc, testing againts previous rows with the same indicator
            ##Objective 2
            if (row['col3'], row['col6']) == (2, 1) and (p['col3'], p['col6']) == (1, 2):
                print 'data {p[class]} & {r[class]} data {r[col7]}'.format(p=p, r=row)
            ##Objective 3
            if (row['col3'], row['col6']) == (3, 3) and (p['col3'], p['col6']) == (2, 2) and (p['col3'], p['col6']) == (1, 1):
                print 'text data text data {r[col8]}'.format(p=p, r=row)     

        # remember this row for later rows to match against.
        previous.append(row)

Can you anyone tell me what I have done wrong for objective 3? I recieved no traceback but also recieved no text strings.


Solution

  • You can store information keyed by column 2 in a dictionary for easy lookup; for each unique column value keep a list of entries to match against later on.

    A collections.defaultdict() object makes the first part easy. I'd use csv.DictReader() to give each column a meaningful name; instead of mentally having to map each column number to a meaning, the columns then have names, much easier to track:

    from collections import defaultdict
    import csv
    
    # you probably can think up better names
    fields = ('x', 'class', 'indicator', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8')
    
    entries = defaultdict(list)
    
    with open(filename, 'rb') as fd:
        reader = csv.DictReader(fd, fields)
    
        for row in reader:
            # each row is now a dictionary
            # make your numbers, numbers
            for field in fields[3:]:
                row[field] = row[field] and float(row[field])
    
            previous = entries[row['indicator']]
            for p in previous:
                if (row['col3'], row['col6']) == (2, 2) and (p['col3'], p['col6']) == (1, 1):
                    print 'text data text {p[class]} text data {r[class]} text {r[col8]}'.format(p=p, r=row)
                # etc, testing againts previous rows with the same indicator
    
            # remember this row for later rows to match against.
            previous.append(row)
    

    This just matches your first scenario, but the other scenarios are just as easy to match.

    This should be efficient enough if the number of entries per unique class indicator is low. If you run into hundreds (or worse) of rows per indicator, you need to start looking into efficient matching structures per scenario (as they match in different ways), to speed up lookups. This will probably require more memory, trading memory for increased speed.

    Testing the above against your input dataset prints:

    text data text classq text data classqe text 3.1
    

    Adjusting the code to support unique (col3, col6) tuples:

    from collections import defaultdict
    import csv
    
    # you probably can think up better names
    fields = ('x', 'class', 'indicator', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8')
    
    entries = defaultdict(dict)
    
    with open(filename, 'rb') as fd:
        reader = csv.DictReader(fd, fields)
    
        for row in reader:
            # each row is now a dictionary
            # make your numbers, numbers
            for field in fields[3:]:
                row[field] = row[field] and float(row[field])
    
            key = (row['col3'], row['col6'])
            previous = entries[row['indicator']]
    
            # scenario 1
            if key == (2, 2) and (1, 1) in previous:
                p = previous[(1, 1)]
                print 'text data text {p[class]} text data {r[class]} text {r[col8]}'.format(p=p, r=row)
    
            # scenario 3
            if key = (3, 3) and (1, 1) in previous and (2, 2) in previous:
                print 'text data text data {r[col8]}'.format(r=row)
    
            # remember this row for later rows to match against.
            previous[key] = row