Search code examples
pythonpandasdata-conversiondata-import

How can I read .dat file containing multiple tables into a pandas data frame?


I have a measuring device that records data in .dat files like raw_data.dat in this folder, all with the same structure, and I want to be able to extract the last table in the file into a pandas data frame.

The file has a few tables and I am not sure if the tabulation structure here is standard for .dat files but I have tried to paste the text into excel and it recognises the text as separate table, so there is probably a fairly standard way to read the structure correctly into python. I couldn't find one so I've tried a really convoluted way, reading the .dat file into a string and manually chopping off the top of part of the file and saving the rest as .dat file. My hope is to then somehow be able to save the result as a .csv or .xls but I still can't find how to do that either. Furthermore after importing the tabulation is converted to \t's and this doesn't go back to tabulation in the saved files. My code for that is below


mylines = []                             
with open ('raw_file.dat', 'rt') as myfile:
    for myline in myfile:
        mylines.append(myline)

string = (mylines[8:])

with open("updated.dat", "w") as output:
    output.write(str(string))

I must admit I am fairly new to python and I am not certain I use the functions correctly. Still, I hope there is a more straightforward way to go about it than the workaround I am attempting.


Solution

  • If you can be sure that the third table you want starts at the 8th line, then there's no reason you have to make it more complicated than just indexing the file from the 8th line up. From there, you can use string manipulation and list comprehension to clean your data:

    import pandas as pd
    
    # Read the data.
    with open('raw_data.dat', 'r') as fh:
        lines = fh.readlines()[8:]
    
    # Remove newlines, tabs, and split each string separated by spaces.
    clean = [line.strip.replace('\t', '').split() for line in lines]
    
    # Feed the data into a DataFrame.
    data = pd.DataFrame(clean[1:], columns=clean[0])
    

    which outputs:

                   Time         Variab1e1  ...               v18               v19
    0  +0.00000000e+000  +3.04142181e-002  ...  +0.00000000e+000  +0.00000000e+000
    1  +1.00000000e+000  +1.96144191e-001  ...  +1.00000000e+000  +0.00000000e+000
    2  +2.00000000e+000  +3.75599731e-001  ...  +2.00000000e+000  +0.00000000e+000
    

    If you want to convert the values into floats, you can do this before converting the data into a DataFrame:

    headers = clean[0]
    rows = [[float(value) for value in row] for row in clean[1:]]
    
    data = pd.DataFrame(rows, columns=headers)
    

    which gives you a much cleaner frame:

       Time  Variab1e1  Variable2  Variable3  Variable4  ...  v15  v16   v17  v18  v19
    0   0.0   0.030414        0.0   1.383808        0.0  ...  0.0  0.0  15.0  0.0  0.0
    1   1.0   0.196144        1.0   7.660262        1.0  ...  0.0  1.0  15.0  1.0  0.0
    2   2.0   0.375600        2.0  15.356726        2.0  ...  0.0  2.0  15.0  2.0  0.0