I tried to clean the data as tabular by using pandas and the data is an excel file. The problem is merged row cell which is a type of data. I want to transpose it as a new column to classify the data.
What I'm Facing enter image description here What I'm expected. enter image description here
What I'm Facing
Sample CD4 CD8
---------Group1-------------
Day 1 8311 17.3 6.44
Day 2 8312 13.6 3.50
Day 3 8321 19.8 5.88
---------Group2-------------
Day 1 8322 13.5 4.09
Day 2 8311 16.0 4.92
Day 3 8312 5.67 2.28
---------Group3-------------
Day 1 8321 13.0 4.34
Day 2 8322 10.6 1.95
Day 3 8312 5.67 2.28
What I'm expected.
Sample CD4 CD8 Group
Day 1 8311 17.3 6.44 Group1
Day 2 8312 13.6 3.50 Group1
Day 3 8321 19.8 5.88 Group1
Day 1 8322 13.5 4.09 Group2
Day 2 8311 16.0 4.92 Group2
Day 3 8312 5.67 2.28 Group2
Day 1 8321 13.0 4.34 Group3
Day 2 8322 10.6 1.95 Group3
Day 3 8312 5.67 2.28 Group3
Not sure if transposing is the best way to go... As you have not showed us your excel file... I used example as per image below...
I am not a pro, so my approach is not perfect. The code will read data in a small chunks per each group and combine all to main dataframe.
import pandas as pd
df = pd.read_excel('TestData.xlsx') # all data to df
# finding rows containing substring "Group" in the first column
indices = df[df['Unnamed: 0'].str.contains('Group')].index.tolist()
# creating empty dataframe for future data
masterDF = pd.DataFrame()
# looping initial dataframe
for i in range(len(indices)):
# slicing df from first indice to the last one
# or, in other words,
# reading the source excel file in chunks per each indice
try:
data = df.loc[indices[i]: indices[i + 1] - 1, :].copy()
except IndexError:
data = df.loc[indices[i]:, :].copy()
# adding new Columns for each group
data['Group'] = data.iloc[0,0]
print(data.iloc[0,0])
print(data)
print('end_______________')
# now appending each chunk of data to the main dataframe
masterDF = pd.concat([masterDF, data])
# cleaning up dataframe before spitting out to .csv
masterDF.dropna(subset=['Sample', 'CD4', 'CD8'], how='all', inplace=True) # dropping rows with no data
masterDF.rename(columns={masterDF.columns[0]: ''}, inplace=True)
# spitting out the final data
masterDF.to_csv('output.csv', index=False)
out:
Sample CD4 CD8 Group
1 Day 1 8311.0 17.30 6.44 Group1
2 Day 2 8312.0 13.60 3.50 Group1
3 Day 3 8321.0 19.80 5.88 Group1
5 Day 1 8322.0 13.50 4.09 Group2
6 Day 2 8311.0 16.00 4.92 Group2
7 Day 3 8312.0 5.67 2.28 Group2
9 Day 1 8321.0 13.00 4.34 Group3
10 Day 2 8322.0 10.60 1.95 Group3
11 Day 3 8312.0 5.67 2.28 Group3