Search code examples
pythonpandasexcelopenpyxldata-conversion

is it possible to split data while moving it from one excel file to another in python? Pandas or openpyxl


Essentially I need to translate one excel document into another.

Both sheets are formatted differently, but contain most of the same information - however in sheet 1, some data is formatted in a different way.

For Example "Name" in Sheet 1, and "First Name" "Last Name" in Sheet 2

is it possible to have my script do this for me? What about looking for seperators like a comma to split "Address" into "Street" "city" "state" "zip" -- or is that best done post-translate using Excel tools.

I've been able to read rows directly using openpyxl with code like this:

    step = 2
    read_start_row = 4
    write_start_row = 3
    amount_of_rows = 30

    for i in range(0, amount_of_rows, step):
        #copy from wb1
        c = ws1.cell(row=read_start_row+i, column=4)
        #paste in ws2
        ws2.cell(row=write_start_row+(i/step), column=4, value=c.value)

but not sure where to start when trying to also alter the data.


Solution

  • This is a quick example if just copying the data.

    The Example sheet has two columns; 'Name' and 'Address' where;

    Name includes "First" and "Last" name separated by space
    Address includes "Street" "City" "State" and "Zip" separated by comma

    enter image description here

    The example code reads the source Excel sheet splits the two columns into the constituents and writes the result back to the destinaton sheet

    import pandas as pd
    
    
    ### Read data from source sheet
    df = pd.read_excel('source.xlsx', sheet_name='Sheet1')
    
    ### Split the necessary columns on the delimiters
    df[['First', 'Last']] = df['Name'].str.split(' ', n=1, expand=True)  # Delimiter is space
    df[['Street', 'City', 'State', 'Zip']] = df['Address'].str.split(', ', n=3, expand=True)  # Delimiter is comma
    
    ### Drop the now unnecessary columns
    df = df.drop(['Name', 'Address'], axis=1)
    
    ### Reorder columns, probably not needed in this case but ensures the columns are in correct order.
    df = df[['First', 'Last', 'Street', 'City', 'State', 'Zip']] 
    
    ### Write to the destination sheet, start row 2 (startrow 1), drop index and headers
    with pd.ExcelWriter('dest.xlsx', mode='a', engine='openpyxl', if_sheet_exists='overlay') as writer:
        df.to_excel(writer, sheet_name="Sheet1", startrow=1, index=False, header=False)
    

    Resultant dataframe

       First   Last          Street     City State         Zip
    0  Mavis   West  421 E DRACHMAN   TUCSON    AZ  85705-7598
    1   John  Spurs     100 MAIN ST  SEATTLE    WA       98104
    2   Jack   East   105 KROME AVE    MIAMI    FL  33185 3700
    

    In this example the destination Sheet contains the headers already on row 1 (note this is row 0 for to_excel). For this since writing to an existing sheet which already contains data (Headers) we use 'mode=a' (append) which requires Openpyxl as the engine.
    The to_excel write excludes the Headers from the dataframe being they already exist, however if preferred the destination sheet could be empty and the headers written along with the columns.
    enter image description here