I would like to remove duplicates from column 1 and return in colum 2 the related list of values associated to each unique item using python.
The input is
1 2
Jack London 'Son of the Wolf'
Jack London 'Chris Farrington'
Jack London 'The God of His Fathers'
Jack London 'Children of the Frost'
William Shakespeare 'Venus and Adonis'
William Shakespeare 'The Rape of Lucrece'
Oscar Wilde 'Ravenna'
Oscar Wilde 'Poems'
while the output should be
1 2
Jack London 'Son of the Wolf, Chris Farrington, Able Seaman, The God of His Fathers,Children of the Frost'
William Shakespeare 'The Rape of Lucrece,Venus and Adonis'
Oscar Wilde 'Ravenna,Poems'
where the second column harbouring the sum of values associated to each item. I tried the set() function on dictionary
dic={'Jack London': 'Son of the Wolf', 'Jack London': 'Chris Farrington', 'Jack London': 'The God of His Fathers'}
set(dic)
but it returned only the first key of dictionary
set(['Jack London'])
You should use itertools.groupby
since your list is sorted.
rows = [('1', '2'),
('Jack London', 'Son of the Wolf'),
('Jack London', 'Chris Farrington'),
('Jack London', 'The God of His Fathers'),
('Jack London', 'Children of the Frost'),
('William Shakespeare', 'Venus and Adonis'),
('William Shakespeare', 'The Rape of Lucrece'),
('Oscar Wilde', 'Ravenna'),
('Oscar Wilde', 'Poems')]
# I'm not sure how you get here, but that's where you get
from itertools import groupby
from operator import itemgetter
grouped = groupby(rows, itemgetter(0))
result = {group:', '.join([value[1] for value in values]) for group, values in grouped}
This gives you a result of:
In [1]: pprint(result)
{'1': '2',
'Jack London': 'Son of the Wolf, Chris Farrington, The God of His Fathers, '
'Children of the Frost',
'Oscar Wilde': 'Ravenna, Poems',
'William Shakespeare': 'Venus and Adonis, The Rape of Lucrece'}