I have a measuring device that records data in .dat files like raw_data.dat in this folder, all with the same structure, and I want to be able to extract the last table in the file into a pandas data frame.
The file has a few tables and I am not sure if the tabulation structure here is standard for .dat files but I have tried to paste the text into excel and it recognises the text as separate table, so there is probably a fairly standard way to read the structure correctly into python. I couldn't find one so I've tried a really convoluted way, reading the .dat file into a string and manually chopping off the top of part of the file and saving the rest as .dat file. My hope is to then somehow be able to save the result as a .csv or .xls but I still can't find how to do that either. Furthermore after importing the tabulation is converted to \t's and this doesn't go back to tabulation in the saved files. My code for that is below
mylines = []
with open ('raw_file.dat', 'rt') as myfile:
for myline in myfile:
mylines.append(myline)
string = (mylines[8:])
with open("updated.dat", "w") as output:
output.write(str(string))
I must admit I am fairly new to python and I am not certain I use the functions correctly. Still, I hope there is a more straightforward way to go about it than the workaround I am attempting.
If you can be sure that the third table you want starts at the 8th line, then there's no reason you have to make it more complicated than just indexing the file from the 8th line up. From there, you can use string manipulation and list comprehension to clean your data:
import pandas as pd
# Read the data.
with open('raw_data.dat', 'r') as fh:
lines = fh.readlines()[8:]
# Remove newlines, tabs, and split each string separated by spaces.
clean = [line.strip.replace('\t', '').split() for line in lines]
# Feed the data into a DataFrame.
data = pd.DataFrame(clean[1:], columns=clean[0])
which outputs:
Time Variab1e1 ... v18 v19
0 +0.00000000e+000 +3.04142181e-002 ... +0.00000000e+000 +0.00000000e+000
1 +1.00000000e+000 +1.96144191e-001 ... +1.00000000e+000 +0.00000000e+000
2 +2.00000000e+000 +3.75599731e-001 ... +2.00000000e+000 +0.00000000e+000
If you want to convert the values into floats, you can do this before converting the data into a DataFrame:
headers = clean[0]
rows = [[float(value) for value in row] for row in clean[1:]]
data = pd.DataFrame(rows, columns=headers)
which gives you a much cleaner frame:
Time Variab1e1 Variable2 Variable3 Variable4 ... v15 v16 v17 v18 v19
0 0.0 0.030414 0.0 1.383808 0.0 ... 0.0 0.0 15.0 0.0 0.0
1 1.0 0.196144 1.0 7.660262 1.0 ... 0.0 1.0 15.0 1.0 0.0
2 2.0 0.375600 2.0 15.356726 2.0 ... 0.0 2.0 15.0 2.0 0.0