Essentially, what I described in the title.
I am trying to combine two dataframes (i.e. df1
& df2
) where they have different amounts of columns (df1=3
, df2=8
) with varying row lengths. (The varying row lengths stem from me having a script that breaks main two excel lists into blocks based on a date condition).
My goal is to combine the two length-varying dataframes into one dataframe, where they both start at index 0
instead of one after the other.
What is currently happening:
A B C D
0 1 2 nan nan
1 3 4 nan nan
2 nan nan 5 6
3 nan nan 7 8
4 nan nan 9 10
This is how I would like it to be:
A B C D
0 1 2 5 6
1 3 4 7 8
2 nan nan 9 10
I tried many things, but this is the last code that worked (but with wrong results):
import pandas as pd
hours_df = pd.read_excel("hours.xlsx").fillna("")
hours_columns = hours_df.columns
material_df = pd.read_excel("material.xlsx").fillna("")
material_df = material_df.rename(columns={'Date': 'Material Date'})
material_columns = material_df.columns
breaker = False
temp = []
combined_df = pd.DataFrame()
last_date = "1999-01-01"
for _, row in hours_df.iterrows():
if row["Date"] != "":
block_df = pd.DataFrame(temp, columns=hours_columns)
if temp:
cell_a1 = block_df.iloc[0,0]
filtered_df = material_df.loc[
(material_df["Material Date"] < cell_a1) &
(material_df["Material Date"] >= last_date)]
last_date = cell_a1
combined_block = pd.concat([block_df, filtered_df], axis=1)
combined_df = pd.concat([combined_df, combined_block], ignore_index=True)
temp = []
temp.append(row)
if temp:
block_df = pd.DataFrame(temp, columns=hours_columns)
combined_df = pd.concat([combined_df, block_df], ignore_index=True)
print(combined_df)
I am not getting any errors. Just stacked output -- like the one I showed above.
Your issue arises because you are concatenating dataframes vertically rather than horizontally. To achieve the desired output, you need to align rows from df1
and df2
with the same index and then concatenate horizontally.
Here’s the updated code that would produce the output you want. I have added comments on the places where I've made the changes.
import pandas as pd
# Loading dataframes
hours_df = pd.read_excel("hours.xlsx").fillna("")
material_df = pd.read_excel("material.xlsx").fillna("")
material_df = material_df.rename(columns={'Date': 'Material Date'})
temp = []
combined_df = pd.DataFrame()
last_date = "1999-01-01"
for _, row in hours_df.iterrows():
if row["Date"] != "":
block_df = pd.DataFrame(temp, columns=hours_df.columns)
if temp:
# Filter material_df based on the date range
first_date_in_block = block_df.iloc[0, 0]
filtered_df = material_df.loc[
(material_df["Material Date"] < first_date_in_block) &
(material_df["Material Date"] >= last_date)
]
last_date = first_date_in_block
# Reset indices for horizontal alignment
block_df.reset_index(drop=True, inplace=True)
filtered_df.reset_index(drop=True, inplace=True)
# Concatenate horizontally
combined_block = pd.concat([block_df, filtered_df], axis=1)
combined_df = pd.concat([combined_df, combined_block], ignore_index=True)
temp = []
temp.append(row)
# Handling the remaining block
if temp:
block_df = pd.DataFrame(temp, columns=hours_df.columns)
combined_df = pd.concat([combined_df, block_df], ignore_index=True)
print(combined_df)