Search code examples
pythonexcelpandasxlsxxlsxwriter

Aggregate columns with same date (sum)


So, i need to aggregate rows where the date is the same.

My code, as of now, returns the following:

                      date  value  source
0  2018-04-08 15:52:26.110      1  ANAPRO
1  2018-04-22 12:14:38.807      1  ANAPRO
2  2018-04-22 12:34:18.403      1  ANAPRO
3  2018-04-22 12:40:35.877      1  ANAPRO
4  2018-04-22 12:53:57.897      1  ANAPRO
5  2018-04-22 13:02:45.180      1  ANAPRO
6  2018-05-04 17:41:15.840      1  ANAPRO
7  2018-04-22 15:03:54.353      1  ANAPRO
8  2018-04-22 15:24:27.030      1  ANAPRO
9  2018-04-22 15:27:56.813      1  ANAPRO

I don't think I can aggregate the columns while I have HH:MM:SS.ms being showed alongside the date (I only need the date)

I've tried this :

df['date'] = pandas.to_datetime(df['date'], format='%b %d %Y.%f').astype(str)

But to no avail, I still got the same return.

The code is:

Reads the my excel file (user input).

df = pandas.read_excel(var + '.xlsx')

Selects the columns I need, and create a new .xlsx to contain it.

df = df.iloc[:, 36].to_excel(var + '_.xlsx', index=False)

Opens the new .xlsx file.

df = pandas.read_excel(var + '_.xlsx')

Renames the column

df = df.rename(columns={'Prospect Dt. Cadastro': 'date'})

Adds the other columns I need.

df['value'] = 1
df['source'] = 'ANAPRO'

Tries to format the date.

df['date'] = pandas.to_datetime(df['date'], format='%b %d` %Y.%f').astype(str)

Creates the final xlsx, with all the formatted data.

df = df.to_excel('payload.xlsx')

Reads the final xlsx.

df = pandas.read_excel('payload.xlsx', names=['date', 'value', 'source'])

Prints the first 10 rows.

print(df.head(10))

I'm new to python, so sorry if I'm doing something awkward, thank you!


Solution

  • IIUC, you might want pandas.Series.dt.date:

    df['date'] = pandas.to_datetime(df['date']).dt.date
    
    >>> df
             date  value  source
    0  2018-04-08      1  ANAPRO
    1  2018-04-22      1  ANAPRO
    2  2018-04-22      1  ANAPRO
    3  2018-04-22      1  ANAPRO
    4  2018-04-22      1  ANAPRO
    5  2018-04-22      1  ANAPRO
    6  2018-05-04      1  ANAPRO
    7  2018-04-22      1  ANAPRO
    8  2018-04-22      1  ANAPRO
    9  2018-04-22      1  ANAPRO
    

    Or, if your goal is aggregation using groupby, you can retain all the information in your original date column, and group by only the date as such:

    df['date'] = pandas.to_datetime(df['date'])
    
    df.groupby(df['date'].dt.date)
    # for example, to get the sum each day:
    # df.groupby(df['date'].dt.date).sum()
    
    #             value
    # date             
    # 2018-04-08      1
    # 2018-04-22      8
    # 2018-05-04      1
    

    Or, using pd.Grouper:

    df['date'] = pandas.to_datetime(df['date'])
    
    df.groupby(pd.Grouper(key='date', freq='D'))