Search code examples
pythonpandasdatatableopenpyxl

Add substring to every non null cell through pandas


I have the following xlsx table as input, and i want to obtain the output with pandas. Right now i am using openpyxl but the file is 8k rows and 200+ col, so the code i am using is not efficient and takes more than 20minutes to run. The file as shown in the exampla has also NaN and empty cells, i want to modify only the cells that are not empty. As shown in the code this block of code must work from column 2 and from row 5 to the end of the file.

# input
            Main_col        0    1    2
0  cas1 1_05.04.2024 16:40  A    B   
1  cas2 5_05.04.2024 16:41       C   
2  cas3 4_05.04.2024 17:30  D         E
# output
            Main_col                    0                          1                        2
0  cas1 1_05.04.2024 16:40  A_05.04.2024 16:40.cas1 1  B_05.04.2024 16:40.cas1 1                    
1  cas2 5_05.04.2024 16:41                             C_05.04.2024 16:41.cas2 5                    
2  cas3 4_05.04.2024 17:30  D_05.04.2024 17:30.cas3 4                            E_05.04.2024 17:30.cas3 4

the code i am using is the following.

for colonn in range(2,ws.max_column+1):
    #print("Elaboro colonna: " + str(colonn))
    for rig in range(5,ws.max_row+1):
        ValoreCell = str(ws.cell(rig,colonn).value)
        Valoreheader = str(ws.cell(rig,1).value)
        if ValoreCell != None and ValoreCell != " ":
            if "_" in ValoreCell:
                Valoreheader = Valoreheader.split("_")[0]
                #print("Valoreheader " + str(Valoreheader))
                Valor = ws.cell(rig,colonn).value
                ws.cell(rig,colonn).value = str(Valor) + "." + str(Valoreheader)

i am new in stackoverflow

a major improve in performance using pandas


Solution

  • You can build a mask, rework the Main_col string with str.replace and use boolean indexing with add to concatenate the strings:

    # identify NaN/empty cells
    mask = df.fillna('').ne('')
    # exclude Main_col from mask
    mask['Main_col'] = False
    
    # reorder the "casX n_YYY" into "_YYY.casX n"
    s = df['Main_col'].str.replace(r'([^ ]+) (\d)(_.*)', r'\3.\1 \2', regex=True)
    
    # concatenate strings
    df[mask] = df.astype(str).add(s, axis=0)
    

    Output:

                      Main_col                          0                          1                          2
    0  cas1 1_05.04.2024 16:40  A_05.04.2024 16:40.cas1 1  B_05.04.2024 16:40.cas1 1                           
    1  cas2 5_05.04.2024 16:41                             C_05.04.2024 16:41.cas2 5                           
    2  cas3 4_05.04.2024 17:30  D_05.04.2024 17:30.cas3 4                             E_05.04.2024 17:30.cas3 4
    

    timing

    on a 8k rows x 200 columns input:

    374 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    excluding the first N rows:

    Just add one step to modify the mask

    # exclude first N rows
    N = 3
    mask.iloc[:N] = False
    

    Example output with N=3 and an input of 6 rows:

                      Main_col                          0                          1                          2
    0  cas1 1_05.04.2024 16:40                          A                          B                           
    1  cas2 5_05.04.2024 16:41                                                     C                           
    2  cas3 4_05.04.2024 17:30                          D                                                     E
    3  cas1 1_05.04.2024 16:40  A_05.04.2024 16:40.cas1 1  B_05.04.2024 16:40.cas1 1                           
    4  cas2 5_05.04.2024 16:41                             C_05.04.2024 16:41.cas2 5                           
    5  cas3 4_05.04.2024 17:30  D_05.04.2024 17:30.cas3 4                             E_05.04.2024 17:30.cas3 4