Search code examples
pythonpandaspivot-tablelarge-data

Count duplicated values, delete duplicates and keep count and other columns


I'm setting up a data-set of about 10 000 rows and 55 columns from a excel file format. I pick out the relevant column to be displayed (Number and Date).

Now, the column "Number" has many duplicated values that i want to count and then remove the duplicates. In the same time i want to show the latest date the number was used.

Put in an example:

Column 1 = Numbers [445, 446, 447, 449, 445, 451, 445, 466, 449, ...]
Column 2 = Date [4/26/2019,3/26/2019,3/15/2019,2/26/2019,12/26/2018,12/16/2018,11/26/2018,11/6/2018,11/01/2019,... ]

445 and 447 is duplicated values; 445 is counted 3 times and 449 is counted 2 times at different dates.

The table i want to create is then:

Column 1 = Numbers [445, 446, 447, 449, 451, 466, ...]
Column 2 = Date [4/26/2019,3/26/2019,3/15/2019,2/26/2019,12/16/2018,11/6/2018,,...]
Column 3 = Count [3,1,1,2,1,1,...]

I.e. the date to be kept in the new table is the latest date when the number is used.

import pandas as pd

data = pd.read_excel(r'ImportedFile.xlsx', header = 0)
df = data[['Number','Date']]
sold_total = df.pivot_table(index=['Number'], aggfunc='size')

What to be next? Thanks


Solution

  • Use:

    df['Count']=df.groupby('Column_1').transform('count')
    df=df.drop_duplicates('Column_1')
    print(df)
    

       Column_1   Column_2  Count
    0       445 2019-04-26      3
    1       446 2019-03-26      1
    2       447 2019-03-15      1
    3       449 2019-02-26      2
    5       451 2018-12-16      1
    7       466 2018-11-06      1