Search code examples
pythonpandasdataframepandas-groupby

python pandas how to edit value of duplicate?


I want to change value of my duplicate. Here is my dataframe:

  sku
FAT-001
FAT-001
FAT-001
FAT-002
FAT-002

my expected data frame will be look like this

  sku
FAT-001 #don't want to change first value of duplicate 
FAT-001-01
FAT-001-02
FAT-002
FAT-002-01

Solution

  • df["number"] = df.groupby("sku").cumcount()
    df.apply(lambda x: x.sku + ("" if x.number == 0 else "-" + str(x.number).rjust(2,"0")), axis = 1)
    

    or:

    df["number"] = "-" + df.groupby("sku").cumcount().astype(str).str.rjust(2, '0')
    df.number[df["number"].eq("-" + "0" * 2)] = ""
    df.sku + df.number 
    

    my output:

    0       FAT-001
    1    FAT-001-01
    2    FAT-001-02
    3       FAT-002
    4    FAT-002-01
    dtype: object
    

    explanation:

    what is groupby? it is a sql-inspired command that give you some element base on every unique element... for example: length, max, list or other...

    df = pd.DataFrame([
        [1,2],
        [1,3],
        [1,4],
        [2,5],
        [2,6],
    ], columns=["id","number"])
    df.groupby("id").agg({"number": len})
    

    give you:

        number
    id  
    1   3
    2   2
    

    the number of every unique elements, or

    df.groupby("id").agg({"number": list})
    

    give you

        number
    id  
    1   [2, 3, 4]
    2   [5, 6]
    

    and you can try max or min or first...

    using agg, you can specify what you want for every column... if you have more than one (except what was groupby), you can specify something for each column...

    except agg, there are other methods for goupped dataframes: like cumcount, that set index every row for each group, I mean it reset index for every group:

    df.groupby("sku").cumcount()
    

    output:

    0    0
    1    1
    2    2
    3    0
    4    1
    

    your first FAT-001 get index: 0, next: 1, ... and for FAT-002, first get index 0 again...

    so, we have two part that you want, now... therefore we must find a way to join them for every row: axis:1 in apply means for every row

    so, you have one exception: you don't want index for every row of each group... so change it to "", empty:

    df.apply(lambda x: "" if x.number == 0 else str(x.number), axis = 1)
    

    :

    0     
    1    1
    2    2
    3     
    4    1
    dtype: object
    

    your 0 and 3 rows are for new-group...

    next, your desired format is: 01,02,... a 0 for every index. pandas has a method that convert every string to string with desired length with an arbitrary char: rjust(desird_length, arbitrary_char)

    how works: if you call that as rjust(2,"0"), it dos not change "22" or other 2-char, 3-char, ... strings, but, if your string length be 1 like 1 will be converted to 01 and ... (notice that there are a method called ljust too :))

    df["number"] = df.groupby("sku").cumcount()
    df.apply(lambda x: "" if x.number == 0 else str(x.number).rjust(2,"0"), axis = 1)
    
    0      
    1    01
    2    02
    3      
    4    01
    dtype: object
    

    and if statement can written as:

    if x.number == 0:
       return ""
    else:
        return "-" + str(x.number).rjust(2,"0")
    

    and some points:

    1. what is astype(str): it convert every element to string, and works like str(x), but for every element. why? to add "-" before it and using rjust.
    2. what is eq? it is is-equal? and return True for each row if be equal to value, and False otherwise.
    3. why df.number[df["number"].eq("-" + "0" * 2)] = ""? because we convert all first elements of each group to ""
    4. why "-" + "0" * 2? because we add "-" and ljust in previous line so we must use correct value: "-00". and why "0" * 2? because you can use every number for ljust length like 10 and set it there, too