Search code examples
pythonlistpandasmergeunique

How to compile all lists in a column into one unique list


I have a pandas dataframe as below:

enter image description here

How can I combine all the lists (in the 'val' column) into a unique list (set), e.g. [val1, val2, val33, val9, val6, val7]?

I can solve this with the following code. I wonder if there is an easier way to get all unique values from a column without iterating the dataframe rows?

def_contributors=[]
for index, row in df.iterrows():
    contri = ast.literal_eval(row['val'])
    def_contributors.extend(contri)
def_contributors = list(set(def_contributors))

Solution

  • Another solution with exporting Series to nested lists and then apply set to flatten list:

    df = pd.DataFrame({'id':['a','b', 'c'], 'val':[['val1','val2'],
                                                   ['val33','val9','val6'],
                                                   ['val2','val6','val7']]})
    
    print (df)
      id                  val
    0  a         [val1, val2]
    1  b  [val33, val9, val6]
    2  c   [val2, val6, val7]
    
    print (type(df.val.ix[0]))
    <class 'list'>
    
    print (df.val.tolist())
    [['val1', 'val2'], ['val33', 'val9', 'val6'], ['val2', 'val6', 'val7']]
    
    print (list(set([a for b in df.val.tolist() for a in b])))
    ['val7', 'val1', 'val6', 'val33', 'val2', 'val9']
    

    Timings:

    df = pd.concat([df]*1000).reset_index(drop=True)
    
    In [307]: %timeit (df['val'].apply(pd.Series).stack().unique()).tolist()
    1 loop, best of 3: 410 ms per loop
    
    In [355]: %timeit (pd.Series(sum(df.val.tolist(),[])).unique().tolist())
    10 loops, best of 3: 31.9 ms per loop
    
    In [308]: %timeit np.unique(np.hstack(df.val)).tolist()
    100 loops, best of 3: 10.7 ms per loop
    
    In [309]: %timeit (list(set([a for b in df.val.tolist() for a in b])))
    1000 loops, best of 3: 558 µs per loop
    

    If types is not list but string use str.strip and str.split:

    df = pd.DataFrame({'id':['a','b', 'c'], 'val':["[val1,val2]",
                                                   "[val33,val9,val6]",
                                                   "[val2,val6,val7]"]})
    
    print (df)
      id                val
    0  a        [val1,val2]
    1  b  [val33,val9,val6]
    2  c   [val2,val6,val7]
    
    print (type(df.val.ix[0]))
    <class 'str'>
    
    print (df.val.str.strip('[]').str.split(','))
    0           [val1, val2]
    1    [val33, val9, val6]
    2     [val2, val6, val7]
    Name: val, dtype: object
    
    print (list(set([a for b in df.val.str.strip('[]').str.split(',') for a in b])))
    ['val7', 'val1', 'val6', 'val33', 'val2', 'val9']