Add substring to every non null cell through pandas

I have the following xlsx table as input, and i want to obtain the output with pandas. Right now i am using openpyxl but the file is 8k rows and 200+ col, so the code i am using is not efficient and takes more than 20minutes to run. The file as shown in the exampla has also NaN and empty cells, i want to modify only the cells that are not empty. As shown in the code this block of code must work from column 2 and from row 5 to the end of the file.

# input
            Main_col        0    1    2
0  cas1 1_05.04.2024 16:40  A    B   
1  cas2 5_05.04.2024 16:41       C   
2  cas3 4_05.04.2024 17:30  D         E

# output
            Main_col                    0                          1                        2
0  cas1 1_05.04.2024 16:40  A_05.04.2024 16:40.cas1 1  B_05.04.2024 16:40.cas1 1                    
1  cas2 5_05.04.2024 16:41                             C_05.04.2024 16:41.cas2 5                    
2  cas3 4_05.04.2024 17:30  D_05.04.2024 17:30.cas3 4                            E_05.04.2024 17:30.cas3 4

the code i am using is the following.

for colonn in range(2,ws.max_column+1):
    #print("Elaboro colonna: " + str(colonn))
    for rig in range(5,ws.max_row+1):
        ValoreCell = str(ws.cell(rig,colonn).value)
        Valoreheader = str(ws.cell(rig,1).value)
        if ValoreCell != None and ValoreCell != " ":
            if "_" in ValoreCell:
                Valoreheader = Valoreheader.split("_")[0]
                #print("Valoreheader " + str(Valoreheader))
                Valor = ws.cell(rig,colonn).value
                ws.cell(rig,colonn).value = str(Valor) + "." + str(Valoreheader)

i am new in stackoverflow

a major improve in performance using pandas

Solution

You can build a mask, rework the Main_col string with str.replace and use boolean indexing with add to concatenate the strings:

# identify NaN/empty cells
mask = df.fillna('').ne('')
# exclude Main_col from mask
mask['Main_col'] = False

# reorder the "casX n_YYY" into "_YYY.casX n"
s = df['Main_col'].str.replace(r'([^ ]+) (\d)(_.*)', r'\3.\1 \2', regex=True)

# concatenate strings
df[mask] = df.astype(str).add(s, axis=0)

Output:

                  Main_col                          0                          1                          2
0  cas1 1_05.04.2024 16:40  A_05.04.2024 16:40.cas1 1  B_05.04.2024 16:40.cas1 1                           
1  cas2 5_05.04.2024 16:41                             C_05.04.2024 16:41.cas2 5                           
2  cas3 4_05.04.2024 17:30  D_05.04.2024 17:30.cas3 4                             E_05.04.2024 17:30.cas3 4

timing

on a 8k rows x 200 columns input:

374 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

excluding the first `N` rows:

Just add one step to modify the mask

# exclude first N rows
N = 3
mask.iloc[:N] = False

Example output with N=3 and an input of 6 rows:

                  Main_col                          0                          1                          2
0  cas1 1_05.04.2024 16:40                          A                          B                           
1  cas2 5_05.04.2024 16:41                                                     C                           
2  cas3 4_05.04.2024 17:30                          D                                                     E
3  cas1 1_05.04.2024 16:40  A_05.04.2024 16:40.cas1 1  B_05.04.2024 16:40.cas1 1                           
4  cas2 5_05.04.2024 16:41                             C_05.04.2024 16:41.cas2 5                           
5  cas3 4_05.04.2024 17:30  D_05.04.2024 17:30.cas3 4                             E_05.04.2024 17:30.cas3 4

Add substring to every non null cell through pandas

timing

excluding the first N rows:

excluding the first `N` rows: