Search code examples
pythonnumpypython-itertools

Getting distinct values from from a list comprised of lists containing a comma delimited string


Main list:

data = [
["629-2, text1, 12"],
["629-2, text2, 12"],
["407-3, text9, 6"],
["407-3, text4, 6"],
["000-5, text7, 0"],
["000-5, text6, 0"],
]

I want to get a list comprised of unique lists like so:

data_unique = [
["629-2, text1, 12"],
["407-3, text9, 6"],
["000-5, text6, 0"],
]

I've tried using numpy.unique but I need to pare it down further as I need the list to be populated by lists containing a single unique version of the numerical designator in the beginning of the string, ie. 629-2...

I've also tried using chain from itertools like this:

def get_unique(data):
    return list(set(chain(*data)))

But that only got me as far as numpy.unique.

Thanks in advance.


Solution

  • Code

    from itertools import groupby
    
    def get_unique(data):
        def designated_version(item):
            return item[0].split(',')[0]
    
        return [list(v)[0] 
                for _, v in groupby(sorted(data, 
                                           key = designated_version),
                                    designated_version)
               ]
    
     
    

    Test

    print(get_unique(data))
    # Output
    [['629-2, text1, 12'], ['407-3, text9, 6'], ['000-5, text7, 0']]
    

    Explanation

    • Sorts data by designated number (in case not already sorted)
    • Uses groupby to group by the unique version of the numerical designator of each item in list i.e. lambda item: item[0].split(',')[0]
    • List comprehension keeps the first item in each grouped list i.e. list(v)[0]