Search code examples
pythonpandasduplicateson-duplicate-key

Pandas: Split dataframe with duplicate values into dataframe with unique values


I have a dataframe in Pandas with duplicate values in Col1:

Col1
a
a
b
a
a
b

What I want to do is to split this df into different df-s with unique Col1 values in each.

DF1:

Col1
a
b

DF2:

Col1
a
b

DF3:

Col1
a

DF4:

Col1
a

Any suggestions ?


Solution

  • I don't think you can achieve this in a vectorial way.

    One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:

    def cum_uniq(s):
        i = 0
        seen = set()
        out = []
        for x in s:
            if x in seen:
                i+=1
                seen = set()
            out.append(i)
            seen.add(x)
        return pd.Series(out, index=s.index)
    
    out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]
    

    output:

    [  Col1
     0    a,
       Col1
     1    a
     2    b,
       Col1
     3    a,
       Col1
     4    a
     5    b]
    

    intermediate:

    cum_uniq(df['Col1'])
    
    0    0
    1    1
    2    1
    3    2
    4    3
    5    3
    dtype: int64
    
    if order doesn't matter

    Let's ad a Col2 to the example:

      Col1  Col2
    0    a     0
    1    a     1
    2    b     2
    3    a     3
    4    a     4
    5    b     5
    

    the previous code gives:

    [  Col1  Col2
     0    a     0,
       Col1  Col2
     1    a     1
     2    b     2,
       Col1  Col2
     3    a     3,
       Col1  Col2
     4    a     4
     5    b     5]
    

    If order does not matter, you can vectorize it:

    out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]
    

    output:

    [  Col1  Col2
    0    a     0
    2    b     2,
       Col1  Col2
    1    a     1
    5    b     5,
       Col1  Col2
    3    a     3,
       Col1  Col2
    4    a     4]