Search code examples
pythonno-duplicates

find duplicates in a column, return the unique item and list its corresponding values from another column in python


I would like to remove duplicates from column 1 and return in colum 2 the related list of values associated to each unique item using python.

The input is

1 2
Jack London 'Son of the Wolf'
Jack London 'Chris Farrington'
Jack London 'The God of His Fathers'
Jack London 'Children of the Frost'
William Shakespeare  'Venus and Adonis' 
William Shakespeare 'The Rape of Lucrece'
Oscar Wilde 'Ravenna'
Oscar Wilde 'Poems'

while the output should be

1 2
Jack London 'Son of the Wolf, Chris Farrington, Able Seaman, The God of His Fathers,Children of the Frost'
William Shakespeare 'The Rape of Lucrece,Venus and Adonis' 
Oscar Wilde 'Ravenna,Poems'

where the second column harbouring the sum of values associated to each item. I tried the set() function on dictionary

dic={'Jack London': 'Son of the Wolf', 'Jack London': 'Chris Farrington', 'Jack London': 'The God of His Fathers'}
set(dic)

but it returned only the first key of dictionary

set(['Jack London'])

Solution

  • You should use itertools.groupby since your list is sorted.

    rows = [('1', '2'),
            ('Jack London', 'Son of the Wolf'),
            ('Jack London', 'Chris Farrington'),
            ('Jack London', 'The God of His Fathers'),
            ('Jack London', 'Children of the Frost'),
            ('William Shakespeare', 'Venus and Adonis'),
            ('William Shakespeare', 'The Rape of Lucrece'),
            ('Oscar Wilde', 'Ravenna'),
            ('Oscar Wilde', 'Poems')]
    # I'm not sure how you get here, but that's where you get
    
    from itertools import groupby
    from operator import itemgetter
    
    grouped = groupby(rows, itemgetter(0))
    result = {group:', '.join([value[1] for value in values]) for group, values in grouped}
    

    This gives you a result of:

    In [1]: pprint(result)
    {'1': '2',
     'Jack London': 'Son of the Wolf, Chris Farrington, The God of His Fathers, '
                    'Children of the Frost',
     'Oscar Wilde': 'Ravenna, Poems',
     'William Shakespeare': 'Venus and Adonis, The Rape of Lucrece'}