Search code examples
pythoncsvsetunique

Python to extract unique CSV rows


I'm trying to get the first occurrences of each row of a CSV in Python. However, I'm facing an issue. My CSV files looks like this:

1,2,3,a,7,5,y,0
1,2,3,a,3,5,y,8
1,2,3,a,5,3,y,7
1,2,3,d,7,5,n,0
1,2,3,d,3,5,n,8
1,2,3,d,5,3,n,7
2,3,4,f,4,6,y,9
2,3,4,f,5,6,y,9
2,3,4,f,7,3,y,9
2,3,4,e,3,5,n,9
2,3,4,e,0,7,n,9
2,3,4,e,5,8,n,9

I tried this way to get the first occurrences of unique values based on one of the columns.

def unique():
    rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
    columns = zip(*rows)
    uniq = set(columns[1])

    indexed = defaultdict(list)

    for x in uniq:
        i = columns[1].index(x)
        indexed[i] = rows[i]

    return indexed

It works fine for one unique column value set. However,

  1. I'd like to set columns[1] and columns[6] as unique values.
  2. The tricky part is columns[6] is always y or n. If I set that, it returns me only first y and n columns. I'd like to get all the columns that have both columns[1] and columns[6] as unique. For every columns[2] value, I need the first occurrence of y and n rows. Sorry for my poor description. So basically, I'd like my output to be like:
1,2,3,d,7,5,n,0,a
2,3,4,e,3,5,n,9,f

Solution

  • There are some room for improvement in your code, but I didn't want to rewrite it in depth, as you had it almost right. The "key" point is that you need a compound key. This is the pair (r[1],r[6]) that has to be unique. In addition, I took the liberty to use an OrderedDict for fast-lookup, but preserving the row order.

    import csv
    import collections
    
    def unique():
        rows = list(csv.reader(open('try.csv', 'r'), delimiter=','))
        result = collections.OrderedDict()
        for r in rows:
            key = (r[1],r[6])  ## The pair (r[1],r[6]) must be unique
            if key not in result:
                result[key] = r
    
        return result.values()
    
    from pprint import pprint
    pprint(unique())
    

    Producing:

    [['1', '2', '3', 'a', '7', '5', 'y', '0'],
     ['1', '2', '3', 'a', '7', '5', 'n', '0'],
     ['2', '3', '4', 'f', '4', '6', 'y', '9'],
     ['2', '3', '4', 'f', '3', '5', 'n', '9']]