Search code examples
pythonpython-3.xgroup-bypython-itertools

Why itertools.groupby() doesn't work?


I've checked some topics about groupby() but I don't get what's wrong with my example:

students = [{'name': 'Paul',    'mail': '@gmail.com'},
            {'name': 'Tom',     'mail': '@yahoo.com'},
            {'name': 'Jim',     'mail': 'gmail.com'},
            {'name': 'Jules',   'mail': '@something.com'},
            {'name': 'Gregory', 'mail': '@gmail.com'},
            {'name': 'Kathrin', 'mail': '@something.com'}]

key_func = lambda student: student['mail']

for key, group in itertools.groupby(students, key=key_func):
    print(key)
    print(list(group))

This prints each student separately. Why I don't get only 3 groups: @gmail.com, @yahoo.com and @something.com?


Solution

  • For starters, some of the mails are gmail.com and some are @gmail.com which is why they are treated as separate groups.

    groupby also expects the data to be pre-sorted by the same key function, which explains why you get @something.com twice.

    From the docs:

    ... Generally, the iterable needs to already be sorted on the same key function. ...

    students = [{'name': 'Paul', 'mail': '@gmail.com'}, {'name': 'Tom', 'mail': '@yahoo.com'},
                {'name': 'Jim', 'mail': 'gmail.com'}, {'name': 'Jules', 'mail': '@something.com'},
                {'name': 'Gregory', 'mail': '@gmail.com'}, {'name': 'Kathrin', 'mail': '@something.com'}]
    
    key_func = lambda student: student['mail']
    
    students.sort(key=key_func)
    # sorting by same key function we later use with groupby
    
    for key, group in itertools.groupby(students, key=key_func):
        print(key)
        print(list(group))
    
    #  @gmail.com
    #  [{'name': 'Paul', 'mail': '@gmail.com'}, {'name': 'Gregory', 'mail': '@gmail.com'}]
    #  @something.com
    #  [{'name': 'Jules', 'mail': '@something.com'}, {'name': 'Kathrin', 'mail': '@something.com'}]
    #  @yahoo.com
    #  [{'name': 'Tom', 'mail': '@yahoo.com'}]
    #  gmail.com
    #  [{'name': 'Jim', 'mail': 'gmail.com'}]
    

    After fixing both sorting and gmail.com/@gmail.com we get the expected output:

    import itertools
    
    students = [{'name': 'Paul', 'mail': '@gmail.com'}, {'name': 'Tom', 'mail': '@yahoo.com'},
                {'name': 'Jim', 'mail': '@gmail.com'}, {'name': 'Jules', 'mail': '@something.com'},
                {'name': 'Gregory', 'mail': '@gmail.com'}, {'name': 'Kathrin', 'mail': '@something.com'}]
    
    key_func = lambda student: student['mail']
    
    students.sort(key=key_func)
    
    for key, group in itertools.groupby(students, key=key_func):
        print(key)
        print(list(group))
    
    #  @gmail.com
    #  [{'mail': '@gmail.com', 'name': 'Paul'},
    #   {'mail': '@gmail.com', 'name': 'Jim'},
    #   {'mail': '@gmail.com', 'name': 'Gregory'}]
    #  @something.com
    #  [{'mail': '@something.com', 'name': 'Jules'},
    #   {'mail': '@something.com', 'name': 'Kathrin'}]
    #  @yahoo.com
    #  [{'mail': '@yahoo.com', 'name': 'Tom'}]