Search code examples
pythonpandasdatabaseparsingtext-parsing

How do i parse data file with similar structure to INI using python?


i have a file containing large data in the following format :

[metadata]      
metadata1   metadata2 metadata3
[data]      
label1   label2    label3
0   0   0
0   0   0
0   0   0
...

I am trying to load the data section to process in my python script as a pandas dataframe.

What i've tried so far is loading the file in excel and saving it as a sheet then loading in into a dataframe using pandas.read_excel() but this is not a practical solution as you can imagine.

I would appreciate help on this subject and thanks !


Solution

  • I would approach it following way: open file for reading, consume lines until [data] is found, give it to pandas.read_csv specyfing whitespaces as separator. Let file.txt content

    [metadata]      
    metadata1   metadata2 metadata3
    [data]      
    label1   label2    label3
    0   0   0
    0   0   0
    0   0   0
    

    then

    import pandas as pd
    f = open("file.txt", "r")
    while(next(f).strip()!='[data]'):
        pass
    df = pd.read_csv(f, sep='\s+')
    f.close()
    print(df)
    

    output

       label1  label2  label3
    0       0       0       0
    1       0       0       0
    2       0       0       0
    

    Explanation: tail of file might be considered as csv file with whitespaces as separator, by consuming lines up to [data] I am seeking to point so what follows is legal csv, then use pandas.read_csv function. Note that if number of lines before line with labels is known you might just exploit pandas.read_csv's skiprows.