Search code examples
pythonexcelpandasdataframedata-science

Pandas: Reading Excel with merged row cells


I tried to clean the data as tabular by using pandas and the data is an excel file. The problem is merged row cell which is a type of data. I want to transpose it as a new column to classify the data.

What I'm Facing enter image description here What I'm expected. enter image description here

What I'm Facing

        Sample   CD4     CD8
---------Group1-------------           
Day 1   8311    17.3    6.44
Day 2   8312    13.6    3.50
Day 3   8321    19.8    5.88
---------Group2------------- 
Day 1   8322    13.5    4.09
Day 2   8311    16.0    4.92
Day 3   8312    5.67    2.28
---------Group3------------- 
Day 1   8321    13.0    4.34
Day 2   8322    10.6    1.95
Day 3   8312    5.67    2.28

What I'm expected.

        Sample   CD4     CD8    Group        
Day 1   8311    17.3    6.44    Group1
Day 2   8312    13.6    3.50    Group1
Day 3   8321    19.8    5.88    Group1
Day 1   8322    13.5    4.09    Group2
Day 2   8311    16.0    4.92    Group2
Day 3   8312    5.67    2.28    Group2
Day 1   8321    13.0    4.34    Group3
Day 2   8322    10.6    1.95    Group3
Day 3   8312    5.67    2.28    Group3

Solution

  • Not sure if transposing is the best way to go... As you have not showed us your excel file... I used example as per image below...

    I am not a pro, so my approach is not perfect. The code will read data in a small chunks per each group and combine all to main dataframe.

    enter image description here

    import pandas as pd
    
    df = pd.read_excel('TestData.xlsx') # all data to df
    
    # finding rows containing substring "Group" in the first column
    indices = df[df['Unnamed: 0'].str.contains('Group')].index.tolist()
    
    # creating empty dataframe for future data 
    masterDF = pd.DataFrame()  
    
    
    # looping initial dataframe
    for i in range(len(indices)):
        # slicing df from first indice to the last one
        # or, in other words,
        # reading the source excel file in chunks per each indice
        try:
            data = df.loc[indices[i]: indices[i + 1] - 1, :].copy()
        except IndexError:
            data = df.loc[indices[i]:, :].copy()
    
        # adding new Columns for each group
        data['Group'] = data.iloc[0,0]
        print(data.iloc[0,0])
        print(data)
        print('end_______________')
    
        # now appending each chunk of data to the main dataframe
        masterDF = pd.concat([masterDF, data])
    
    # cleaning up dataframe before spitting out to .csv
    masterDF.dropna(subset=['Sample', 'CD4', 'CD8'], how='all', inplace=True)  # dropping rows with no data
    masterDF.rename(columns={masterDF.columns[0]: ''}, inplace=True)
    
    
    # spitting out the final data
    masterDF.to_csv('output.csv', index=False)
    

    out:

               Sample    CD4   CD8   Group
    1   Day 1  8311.0  17.30  6.44  Group1
    2   Day 2  8312.0  13.60  3.50  Group1
    3   Day 3  8321.0  19.80  5.88  Group1
    5   Day 1  8322.0  13.50  4.09  Group2
    6   Day 2  8311.0  16.00  4.92  Group2
    7   Day 3  8312.0   5.67  2.28  Group2
    9   Day 1  8321.0  13.00  4.34  Group3
    10  Day 2  8322.0  10.60  1.95  Group3
    11  Day 3  8312.0   5.67  2.28  Group3