Search code examples
pythonexcelpandasdataframe

How to get a DataFrame, separate it in to smaller segments and write those into a excel workbook in different worksheets


I have this big data base that I need to separate into diferent segments (diferent "valid until dates") and then get each o the segments into a excel workbook. Each of the segments also should have its own excelsheet.

Df Exemple:

            A                   valid until                       C                D 
Date                                                                        
01/08/2023  Tesouro Educa+      15/12/2030              5,06             5,18   
02/08/2023  Tesouro Educa+      15/12/2030              5,05             5,17   
03/08/2023  Tesouro Educa+      15/12/2030              5,02             5,14   
04/08/2023  Tesouro IPCA+       15/12/2030              5,02             5,14   
07/08/2023  Tesouro IPCA+       15/12/2030              4,98             5,10   
...
01/08/2023  Tesouro Selic       15/12/2031              5,05             5,17   
02/08/2023  Tesouro Selic       15/12/2031              5,03             5,15   
03/08/2023  Tesouro Educa+      15/12/2031              5,00             5,12   
04/08/2023  Tesouro Educa+      15/12/2031              5,02             5,14   
07/08/2023  Tesouro Educa+      15/12/2031              4,99             5,11  

What i tried:

df = pd.read_csv(r'PATH', sep=';')
df.sort_values(by=['valid until','Date'], key=lambda x: np.argsort(natsorted(dados['Date'])))

valid_until_list=dados['valid until'].unique()

for i in valid_until_list:
    df2=df[(df['Date']==f'{i}')]
    df2=df2.set_index(['Data Base'])

    with pd.ExcelWriter(r'NEW_PATH',engine='openpyxl', if_sheet_exists='replace', mode='a') as writer:
        i=i.replace('/','_')
        
# Here i tried changing the name of the worksheets:
        sheet_title=''
        if df2['A']=='Tesouro IPCA+ com Juros Semestrais':
            sheet_title='IPCA+ Juros'
        elif nova_df2['A']=='Tesouro IPCA+':
            sheet_title='IPCA+'
        else:
            sheet_title=''


        df2.to_excel(writer, sheet_name=f'{sheet_title}_{i}')`

The code did work until i tried to change de worksheet title, now is showing this: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'm here to ask 2 things. My df is kinda big (145712 rows), so de code is taking way to long to run. Is there a faster way to do this, maybe without using for and if loops? The second would be: how do i fix this title Error? It is not necessary, but would make my life soooo much easier :))


Solution

  • You can use groupbyto form the DFs and write them directly to sheets; the code below defines groups according to 'valid until' and then sorts each group (ie sub-DF) according to the Date column.

    groupings = df.groupby('valid until')
    
    count = 1
    with pd.ExcelWriter("test.xlsx", mode = 'w') as writer:
        for g in groupings:
            sheetname = 'sheet'+str(count)
            g[1].sort_values(by = 'Date').to_excel(writer, sheet_name=sheetname, index=False)
            count += 1
    

    This gives sheet1, sheet2 etc but you can change to sheet name formation as required. The 'valid until' of each DF is g[0] in the above code if that is useful.

    If it is required to form the groups (and so sub-DFs) using both 'valid until' and 'Date' then the grouping statement would be changed to groupings = df.groupby(['valid until', 'Date']) and the sort expression then omitted from the statement in the for loop.

    The same general principle could be use to form sub-DFs using any combination of the columns.