Search code examples
pythonpython-3.xexcelextractopenpyxl

Extract continuous cell values from multiple excel files using Python


The aim of my task is firstly to extract values from continuous cells of a single excel file. Then the same extraction method will be performed on the remaining excel files of the same folder until the loop ends

For example,
I want to extract values from row 'A283:A9000' at excel file 1. After the extraction at excel file 1 is finished, the value from row 'A283:A9000' at excel file 2 will be extracted, then the extraction at the same rows 'A283:A9000' will be continued on excel file 3, excel file 4, excel files 5 and so on.

I learn how to extract values from multiple excel files from https://www.youtube.com/watch?v=M7YkQpcB4fg The code works well when the values from non-continuous cells are extracted. However, when I try to use the code to extract value from continuous cells ('A283:A9000') of the same sheet, the code fails.

I know the problem occurs when I try to use the code to extract values from continuous cells of the same sheet, but I am not sure how to fix the code to custom with my case. I think the problem is located at the line (cells = ['C11', 'C15', 'D15', 'C16', 'A283:A9000']).

Could anyone give me help?

Cheers

Here is the code that I have tried.

import os
import openpyxl 

folder = r'C:\PhD study\GIS\Wind_Downscale\test_one'
output_file = 'C:\PhD study\GIS\Wind_Downscale\Wind_data_forecast_time.xlsx'

output_wb = openpyxl.Workbook()
output_sheet = output_wb.active
output_sheet.title = 'Wind Data for Forecast Time'

cells = ['C11', 'C15', 'D15', 'C16', 'A283:A9000']

for filename in os.listdir(folder):
    if filename.endswith('.xlsx'):
        
        file = os.path.join(folder, filename)
        
        workbook = openpyxl.load_workbook(file)
       
   
        values = [workbook.active[cell].value for cell in cells]

        
        output_sheet.append(values)
        
output_wb.save(output_file)

Here is the error messsage:

Traceback (most recent call last):

File C:\\Conda5\\lib\\site-packages\\spyder_kernels\\py3compat.py:356 in compat_exec
exec(code, globals, locals)

File c:\\users\\kxz237.spyder-py3\\temp.py:29
values = \[workbook.active\[cell\].value for cell in cells\]

File c:\\users\\kxz237.spyder-py3\\temp.py:29 in \<listcomp\>
values = \[workbook.active\[cell\].value for cell in cells\]

AttributeError: 'tuple' object has no attribute 'value'

Solution

  • Yes you are trying to use 'A283:A9000' as a single cells' co-ordinate hence the attribute error.

    An alternative is you can treat every element of your 'cells' list as a range
    cells = ['C11', 'C15', 'D15', 'C16', 'A283:A9000']
    so for each element the code extracts all the cells that the range covers;
    for 'C11' that would be just 'C11'
    for 'A283:A9000' that would be 'A283', 'A284', 'A285', 'A286', ...

    Use the Openpyxl util openpyxl.utils.cell.cols_from_range(<cells element>) on each element in the cells list.

    import os
    import openpyxl 
    
    folder = r'C:\PhD study\GIS\Wind_Downscale\test_one'
    output_file = 'C:\PhD study\GIS\Wind_Downscale\Wind_data_forecast_time.xlsx'
    
    output_wb = openpyxl.Workbook()
    output_sheet = output_wb.active
    output_sheet.title = 'Wind Data for Forecast Time'
    
    cells = ['C11', 'C15', 'D15', 'C16', 'A283:A9000']
    
    for filename in os.listdir(folder):
        if filename.endswith('.xlsx'):
            file = os.path.join(folder, filename)
    
            workbook = openpyxl.load_workbook(file)
    
            #values = [workbook.active[cell].value for cell in cells]
    
            for rng in cells:  # Each element in 'cells' list 
                ### Get all cells in the elements range
                for allcells in openpyxl.utils.cell.cols_from_range(rng):
                    ### allcells is a tuple of all individual cells in the range rng
                    for cell in allcells:  # Extract each cell
                        values = workbook.active[cell].value
    
                        output_sheet.append([values])
    
    output_wb.save(output_file)
    

    Additional details FYI
    There are two Openpyxl utilities that will return the individual cells of a range
    openpyxl.utils.cell.cols_from_range(range_string) and
    openpyxl.utils.cell.rows_from_range(range_string)

    Either could be used in this scenario given the range provided is just one column.
    However if your range covered two or more columns then the way each would return the individual cells is;
    cols_from_range, cell from each row down the first column, then same down next column etc
    rows_from_range, cells across all columns in first row, then all cells in all columns in the second row etc.

    i.e. a range 'C3:D4' would return
    cols_from_range, 'C3', 'C4', 'D3', 'D4'
    rows_from_range, 'C3', 'D3', 'C4', 'D4'