Search code examples
pythonexcelopenpyxl

How can I get one row from Excel file and add it to other rows in a particular place?


I'm learning to work with openpyxl and need to get information from particular cells and rearrange them to make JSON string from it. I tried different approaches, many of them work, but when I'm dealing with large Excel files I get really long scripts and want to try something shorter. I found a nice short script that works just fine, but I need to add another row to my result which is a little difficult for me because I need this row to be in a particular place and separate from other data. I can't understand how to put that row in the right place.

Here's my Excel file:

excelfile

I need to get information from first column (except words "Place" and "All"), skip all rows and columns that belong to "Year" and get all columns from "1 part", "3 part" and "4 part", skipping "2 part". Also I don't need row № 4 at all. And I need to get information from row № 3.

Result I'm trying to get:

"1 part": "1_Pl": 4, 5, 6, "3 part": "1_Pl": 10, 11, 12, "4_Part": "1_Pl": 13, 14, 15, "1 part": "2_Pl": 19, 20, 21, "3 part": "2_Pl": 25, 26, 27

And so on.

Here's my code:

import openpyxl

START_ROW = 5 # First row with data.
END_ROW = 15 # End of data (row after the last).
TITLE_COL = 1 # Column with title
START_COL = 5 # First column where quarter data starts.
COLS_PER_QUARTER = 3 # Columns per quarter.
QUARTERS_COUNT = 4 # Quarters per year.

wb = openpyxl.load_workbook('C:\\directory\\input.xlsx')
filename='input.xlsx',
read_only=True,
data_only=True
ws = wb.active

entries = []
for row in range(START_ROW, END_ROW):
    def cell(col):
        return ws.cell(row, col).value

    title = cell(TITLE_COL)
    for quarter in range(QUARTERS_COUNT):
        quarter_col = START_COL + quarter * COLS_PER_QUARTER
        entry = f'"{title}": {cell(quarter_col)}, {cell(quarter_col + 1)}, {cell(quarter_col + 2)}'
        entries.append(entry)

I need to get row № 3 and place it before other cells as shown in my result. I also need to remove information from columns that belong to "2 part". Can anyone tell me how to do it? It must be easy, though I just can't get it. Also I know that pandas is much better to achieve my goal, but I want to use openpyxl.


Solution

  • Does this work for you?

    entries = dict()
    for row in ws.iter_rows(min_row=START_ROW, max_row=END_ROW):
        row_data = dict()
        for quarter in (1, 3, 4):
            col_idx = START_COL + COLS_PER_QUARTER * (quarter-1) 
            row_data[f'{quarter} part'] = [row[col_idx + i - 1].value 
                                           for i in range(COLS_PER_QUARTER)]
        entries[row[TITLE_COL-1].value] = row_data
    

    Subtracting one is necessary because iter_rows returns tuples of cells (which are 0-indexed of course), as opposed to worksheet.cell() and similar functions which use 1-based indexing to access data. Something to always be looking our for when using this library.

    This produces nested output a la JSON:

    {
      "1_PL": {
        "1 part": [4.0, 5.0, 6.0],
        "3 part": [10.0, 11.0, 12.0], 
        "4 part": [13.0, 14.0, 15.0]
      },
      "2_PL": {
        "1 part": [19.0, 20.0, 21.0],
         ...
      },
    ...
    }