Search code examples
pythonpandasdata-cleaning

How to clean up extra heading info in middle of csv with pandas


I have a csv file that I am trying to convert into a data frame. But the data has some extra heading material that gets repeated. For example:

Results Generated Date Time  
Sampling Info  
Time; Data  
1; 4.0  
2; 5.2  
3; 6.1  

Results Generated Date Time  
Sampling Info   
Time; Data  
6; 3.2   
7; 4.1   
8; 9.7    

If it is a clean csv file without the extra heading material, I am using

df = pd.read_csv(r'Filelocation', sep=';', skiprows=2)  

But I can't figure out how to remove the second set of header info. I don't want to lose the data below the second header set. Is there a way to remove it so the data is clean? The second header set is not always in the same location (basically a data acquisition mistake).
Thank you!


Solution

  • import pandas as pd
    
    filename = 'filename.csv'
    lines =open(filename).read().split('\n')   # reading the csv file
    
    list_ = [e for e in lines if e!='' ]  # removing '' characters from lines list
    
    list_ = [e for e in list_ if e[0].isdigit()]  # removing string starting with non-numeric characters 
    
    Time = [float(i.split(';')[0]) for i in list_]   # use int or float depending upon the requirements
    
    Data = [float(i.split(';')[1].strip()) for i in list_]
    
    
    df = pd.DataFrame({'Time':Time, 'Data':Data})    #making the dataframe 
    df
    

    I hope this will do the work !