Search code examples
pandasdataframereformatting

Reformatting pandas table when column contains repeated headers


I have the pandas DataFrame below and I want to sort it such that the ["File Name", "File Start Time", etc] are column headers. I can imagine running a loop through the rows looking for strings, but perhaps there is a simpler option for this?

import pandas as pd

data = pd.read_csv(file_path + 'chb01-summary.txt',skiprows = 28, header=None, delimiter = ": ")

file source https://www.physionet.org/pn6/chbmit/chb01/chb01-summary.txt

enter image description here


Solution

  • You can use read_csv and reshape by unstack:

    url = 'https://www.physionet.org/pn6/chbmit/chb01/chb01-summary.txt'
    df = pd.read_csv(url, skiprows=28, sep=':\s+', names=['a','b'], engine='python')
    print (df.head())
                               a             b
    0                   File Name  chb01_01.edf
    1             File Start Time      11:42:54
    2               File End Time      12:42:54
    3  Number of Seizures in File             0
    4                   File Name  chb01_02.edf
    
    df = df.set_index([df['a'].eq('File Name').cumsum(), 'a'])['b']
           .unstack()
           .reset_index(drop=True)
    print (df.head())
    a File End Time     File Name File Start Time Number of Seizures in File  \
    0      12:42:54  chb01_01.edf        11:42:54                          0   
    1      13:42:57  chb01_02.edf        12:42:57                          0   
    2      14:43:04  chb01_03.edf        13:43:04                          1   
    3      15:43:12  chb01_04.edf        14:43:12                          1   
    4      16:43:19  chb01_05.edf        15:43:19                          0   
    
    a Seizure End Time Seizure Start Time  
    0             None               None  
    1             None               None  
    2     3036 seconds       2996 seconds  
    3     1494 seconds       1467 seconds  
    4             None               None