Search code examples
pythonpandasdataframe

Combine Two Pandas Dataframes Side-By-Side With Resulting Length being `max(df1, df2)`


Essentially, what I described in the title.

I am trying to combine two dataframes (i.e. df1 & df2) where they have different amounts of columns (df1=3, df2=8) with varying row lengths. (The varying row lengths stem from me having a script that breaks main two excel lists into blocks based on a date condition).

My goal is to combine the two length-varying dataframes into one dataframe, where they both start at index 0 instead of one after the other.

What is currently happening:

     A  B  C  D
0    1  2 nan nan
1    3  4 nan nan
2  nan nan 5  6
3  nan nan 7  8
4  nan nan 9  10

This is how I would like it to be:

     A  B  C  D
0    1  2  5  6
1    3  4  7  8
2  nan nan 9  10

I tried many things, but this is the last code that worked (but with wrong results):

import pandas as pd


hours_df = pd.read_excel("hours.xlsx").fillna("")
hours_columns = hours_df.columns

material_df = pd.read_excel("material.xlsx").fillna("")
material_df = material_df.rename(columns={'Date': 'Material Date'})
material_columns = material_df.columns


breaker = False

temp = []
combined_df = pd.DataFrame()

last_date = "1999-01-01"


for _, row in hours_df.iterrows():
    if row["Date"] != "":
        block_df = pd.DataFrame(temp, columns=hours_columns)

        if temp:
            cell_a1 = block_df.iloc[0,0]

            filtered_df = material_df.loc[
                    (material_df["Material Date"] < cell_a1) &
                    (material_df["Material Date"] >= last_date)]

            last_date = cell_a1

            combined_block = pd.concat([block_df, filtered_df], axis=1)

            combined_df = pd.concat([combined_df, combined_block], ignore_index=True)

        temp = []

    temp.append(row)

if temp:
    block_df = pd.DataFrame(temp, columns=hours_columns)
    combined_df = pd.concat([combined_df, block_df], ignore_index=True)

print(combined_df)

I am not getting any errors. Just stacked output -- like the one I showed above.


Solution

  • Your issue arises because you are concatenating dataframes vertically rather than horizontally. To achieve the desired output, you need to align rows from df1 and df2 with the same index and then concatenate horizontally.

    Here’s the updated code that would produce the output you want. I have added comments on the places where I've made the changes.

    import pandas as pd
    
    # Loading dataframes
    hours_df = pd.read_excel("hours.xlsx").fillna("")
    material_df = pd.read_excel("material.xlsx").fillna("")
    material_df = material_df.rename(columns={'Date': 'Material Date'})
    
    temp = []
    combined_df = pd.DataFrame()
    last_date = "1999-01-01"
    
    for _, row in hours_df.iterrows():
        if row["Date"] != "":
            block_df = pd.DataFrame(temp, columns=hours_df.columns)
    
            if temp:
                # Filter material_df based on the date range
                first_date_in_block = block_df.iloc[0, 0]
                filtered_df = material_df.loc[
                    (material_df["Material Date"] < first_date_in_block) &
                    (material_df["Material Date"] >= last_date)
                ]
                last_date = first_date_in_block
    
                # Reset indices for horizontal alignment
                block_df.reset_index(drop=True, inplace=True)
                filtered_df.reset_index(drop=True, inplace=True)
    
                # Concatenate horizontally
                combined_block = pd.concat([block_df, filtered_df], axis=1)
                combined_df = pd.concat([combined_df, combined_block], ignore_index=True)
    
            temp = []
    
        temp.append(row)
    
    # Handling the remaining block
    if temp:
        block_df = pd.DataFrame(temp, columns=hours_df.columns)
        combined_df = pd.concat([combined_df, block_df], ignore_index=True)
    
    print(combined_df)