Search code examples
pythonopenpyxlxlsx

Openpyxl optimization to test every cell of 2 workbooks


I try to compare 2 xlsx files (i want to print an error if 2 cells are different ( in value or color, that's why i decided to try Openpyxl instead of pandas or something else).

I managed to have a script that does the job, but i'm not familiar with Python/this package / programming in general, and i would like a review of this code.

from openpyxl import load_workbook 

#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 = [path1, path2...]                  
FileList2 = [path3, path4....]


#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2): 
    #Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)                                                                          
    wb1 = load_workbook(a,  data_only = True )                                                                            
    wb2 = load_workbook(b,  data_only = True )     
   
 #sheetnames comparison to check if the files have the same structure, if not, next couple of files 

    if wb1.sheetnames == wb2.sheetnames :
        #sheetnames are the same, i get the ws names in a list
        WsList = wb1.sheetnames  

        #for each worksheet of my list
        for ws in WsList:  
            #Verification of the structure of the sheet --> if max rows and max columns are the same    
            if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column:                            

                LastRow = wb1[ws].max_row + 1
                LastColumn = wb1[ws].max_column + 1
            
                #This is the part that i don't like, it's very "VBA". For each column and for each row, i check in the 2 files if the values and colors are the same
                for y in range(1,LastColumn):                                                                                    
                    for x in range (1,LastRow) :  
                        
                        #The .cell(row=x, column=y) is supposedly slow, but i don't know any other way

                        Value1  =  wb1[ws].cell(row=x, column=y).value   
                        Value2  =  wb2[ws].cell(row=x, column=y).value 
                        Color1 =  wb1[ws].cell(row=x, column=y).fill.fgColor 
                        Color2 =  wb2[ws].cell(row=x, column=y).fill.fgColor 

                        if Value1 != Value2 : 
                            print("Value Error")

                        if Color1 != Color2 :     
                            print("Color Error")          
   
            else :
                
                print("Structural error")
                                                                                 
    else :
        print("Structural error")

For 2 files with 100k cells to compare, this script takes 5 or 6 seconds to run on my laptop.

I understand that .xslx are complex and that i can't expect to have .csv speed but i think that this code is not very "pythonic".

I tried to use iter_rows and iter_columns but i did not get the results i expected.

Can anyone give me some feedback on this one please ?

Tested with Python 3.8.6 and openpyxl version: 3.0.5

Edit after Charlie Clark's answer : Here is the edited code with the "read_only" option enabled and another way to iterate over rows, columns, and worksheets :

from openpyxl import load_workbook 

#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 =  [filepath1, filepath2...]                
FileList2 = [filepath3, filepath4...]

#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2): 
    #Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)                                                                          
    wb1 = load_workbook(a,  data_only = True, read_only=True )  
    wb2 = load_workbook(b,  data_only = True, read_only=True )      
   
 #sheetnames comparison to check if the files have the same structure, if not, next couple of files 
    if wb1.sheetnames == wb2.sheetnames :
        WsList = wb1.sheetnames
    
        #for each worksheet 
        for ws in WsList:
           
           #Verification of the structure of the sheet --> if max rows and max columns are the same    
            if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column: 

                ws1 = wb1[ws]
                ws2 = wb2[ws]
                
                for row1, row2 in zip(ws1, ws2):
                    for c1, c2 in zip(row1, row2): 

                        #i had an issue with some EmptyCells that i couldn't resolve otherwise
                        if c1.value != None: 

                            Value1 = c1.value
                            Value2 = c2.value

                            Color1 = c1.fill.fgColor.rgb
                            Color2 = c2.fill.fgColor.rgb

                            if not (Value1 == Value2):
                                print('Value error')     
                            
                            if not (Color1 == Color2 ):
                                print("color error")     
            else :
                print(" ws structural error")                                                                    
    else :
        print(" wb structural error")                
    
  
           

For my files i got a 2x speed boost (a big part of the gain is probably because of the read_only option)


Solution

  • It's always difficult to say much about performance without knowing the file, but 6 seconds seems fine to me, assuming this includes the time to load the library and read the files, which could easily be most of the time.

    You might get some improvement using read-only mode, though style access is slightly different there.

           for row1, row2 in zip(ws1, ws2):
              for c1, c2 in zip(row1, row2)
    

    You should zip through the worksheets of the different workbooks and then zip through the rows and the check if not (c1.value == c2.value and c1.fill.fgColor == c2.fillfgColor) because the value comparison will be faster. If you know what you're doing, comparing the relevant StyleArray will be faster than using style objects, but I wouldn't expect that to be the limiting factor here.

    If in doubt, using cProfile on your code.