I have the following xlsx table as input, and i want to obtain the output with pandas. Right now i am using openpyxl but the file is 8k rows and 200+ col, so the code i am using is not efficient and takes more than 20minutes to run. The file as shown in the exampla has also NaN and empty cells, i want to modify only the cells that are not empty. As shown in the code this block of code must work from column 2 and from row 5 to the end of the file.
# input
Main_col 0 1 2
0 cas1 1_05.04.2024 16:40 A B
1 cas2 5_05.04.2024 16:41 C
2 cas3 4_05.04.2024 17:30 D E
# output
Main_col 0 1 2
0 cas1 1_05.04.2024 16:40 A_05.04.2024 16:40.cas1 1 B_05.04.2024 16:40.cas1 1
1 cas2 5_05.04.2024 16:41 C_05.04.2024 16:41.cas2 5
2 cas3 4_05.04.2024 17:30 D_05.04.2024 17:30.cas3 4 E_05.04.2024 17:30.cas3 4
the code i am using is the following.
for colonn in range(2,ws.max_column+1):
#print("Elaboro colonna: " + str(colonn))
for rig in range(5,ws.max_row+1):
ValoreCell = str(ws.cell(rig,colonn).value)
Valoreheader = str(ws.cell(rig,1).value)
if ValoreCell != None and ValoreCell != " ":
if "_" in ValoreCell:
Valoreheader = Valoreheader.split("_")[0]
#print("Valoreheader " + str(Valoreheader))
Valor = ws.cell(rig,colonn).value
ws.cell(rig,colonn).value = str(Valor) + "." + str(Valoreheader)
i am new in stackoverflow
a major improve in performance using pandas
You can build a mask, rework the Main_col string with str.replace
and use boolean indexing with add
to concatenate the strings:
# identify NaN/empty cells
mask = df.fillna('').ne('')
# exclude Main_col from mask
mask['Main_col'] = False
# reorder the "casX n_YYY" into "_YYY.casX n"
s = df['Main_col'].str.replace(r'([^ ]+) (\d)(_.*)', r'\3.\1 \2', regex=True)
# concatenate strings
df[mask] = df.astype(str).add(s, axis=0)
Output:
Main_col 0 1 2
0 cas1 1_05.04.2024 16:40 A_05.04.2024 16:40.cas1 1 B_05.04.2024 16:40.cas1 1
1 cas2 5_05.04.2024 16:41 C_05.04.2024 16:41.cas2 5
2 cas3 4_05.04.2024 17:30 D_05.04.2024 17:30.cas3 4 E_05.04.2024 17:30.cas3 4
on a 8k rows x 200 columns input:
374 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
N
rows:Just add one step to modify the mask
# exclude first N rows
N = 3
mask.iloc[:N] = False
Example output with N=3
and an input of 6 rows:
Main_col 0 1 2
0 cas1 1_05.04.2024 16:40 A B
1 cas2 5_05.04.2024 16:41 C
2 cas3 4_05.04.2024 17:30 D E
3 cas1 1_05.04.2024 16:40 A_05.04.2024 16:40.cas1 1 B_05.04.2024 16:40.cas1 1
4 cas2 5_05.04.2024 16:41 C_05.04.2024 16:41.cas2 5
5 cas3 4_05.04.2024 17:30 D_05.04.2024 17:30.cas3 4 E_05.04.2024 17:30.cas3 4