Search code examples
pythonexcelopenpyxlwin32comxlwings

Extracting all data validation formula values(dropdown) from excel sheet


How can we extract the data validation dropdown values(not just formulas/references or a single valued formula result) from a given excel sheet using python?

Currently openpyxl helps us to get the formulas but not the values directly. It even fails in cases where file contain extensions(extLst) to the OOXML specifications that are not supported.These datavalidation formulas can contain functions like OFFSET, SUBSTITUTE, VLOOKUP, INDIRECT etc. There are some libraries that support parsing/calculation of a subset of formulas(xlcalculator, pycel, formulas etc); but it fails in some cases(limitations).Is there a way we can get them using libraries like xlwings, win32com etc or even using macros with python? The datavalidation formulas can contain plain lists(values), definednames, database table reference etc. and the source file should not be ideally modified. Is there a solution which works in all these cases?


Solution

  • In order to obtain the results of a dynamic Excel formula you need to automate Excel itself, rather than just parse the worksheet file with an XML reader.

    Hard to say this solution will fit all your cases, but it is a starting point. The Excel object model has a Validation Object, which contains all the information about the validation for a cell or range. One of the properties of the Validation might be a formula. The Worksheet/Application Evaluate method can return the result of the formula.

    As an example, cell C3 has a validation formula based on an offset from cell F3:

    enter image description here

    Using win32com, the formula for a validation range can be evaluated, as shown in the following minimal Python code:

    import win32com.client as wc
    
    #Create Excel object
    xl = wc.gencache.EnsureDispatch('Excel.Application')
    xl.Visible=True
    
    #Open saved workbook, get the first worksheet, and a given cell
    wb = xl.Workbooks.Open('MySheet.xlsx')
    ws = wb.Worksheets[1]
    rng = ws.Range('C3')
    
    #Get the Validation object for the range
    valid = rng.Validation
    formula = valid.Formula1
    
    if len(formula) >0: #If formula not empty
        print('Formula for validation of cell',rng.Address,'is',formula)
    
        #Evaluate will return different results depending on formula
        #Offset returns a r x c array of cells (Range objects)
        #Need to cut off the leading '=' character from the formula string
        res = rng.Worksheet.Evaluate(formula[1:])
        arr = [ [c.Value for c in r] for r in res]
        print('Result of validation formula:',arr)
    
    #Tidy up
    wb.Close(False)
    xl.Quit()
    

    with the following output:

    Formula for validation of cell $C$3 is =OFFSET($F$3,2,0,5)
    Result of validation formula: [[3.0], [4.0], [5.0], [6.0], [7.0]]
    

    You may need to experiment with the return values from Evaluate depending on the exact formulas you are using. Using win32com differs from file-reading methods (like openpyxl) in that the actual Excel application is launched, and its calculation engine used to return the results. This does however mean that you need the Excel application installed and accessible (which can be an issue for web/server based use cases).