Using Pandas 2.2.1 within Python 3.12, I'm importing data from text files that have a varying number of rows of metadata and multiple tables that precede the actual data to be imported. The number of rows that need to be skipped is typically between 3500 to 4500 when using the "skiprows" option within the "pd.read_csv" approach. The rows of data that are needed are always located at the very end of the text file and are bracketed by a row of dashes. The file structure (overly simplified) looks like this:
Archive generated by archiveSST02e.f
GEOPHYSICAL LOG
Borehole fluid type: WATER
Borehole fluid resistivity/conductivity: see log
Borehole fluid temperature: see log
CTD manufacturer and SN: Sea and Sun Tech.
***Thousands of rows of more text and tables of unused data***
Then the data that is actually needed.
The last row of dashes is the last row of the text file.
--------------------------------------------------------------------------------------
Spec. Spec.
Corr. cond. cond.
Alt. sub. Sub. Cond. Temp. CTD formula Press. Dens.
feet feet feet uS/cm deg C uS/cm uS/cm dbar kg/m^3
--------------------------------------------------------------------------------------
0.38 0.53 0.13 1067 26.870 1028 1030 0.04 996.93
0.24 0.67 0.27 1048 26.794 1012 1014 0.08 996.95
0.12 0.79 0.38 1014 26.762 980 981 0.11 996.95
-0.34 1.25 0.84 842 26.785 813 814 0.25 996.88
...
-0.34 1.25 0.84 825 26.774 797 798 0.25 996.87
--------------------------------------------------------------------------------------
Example script used to import the text file data using the following python script:
df = pd.read_csv('Name_of_file.txt',encoding='utf-8', delimiter=r"\s+", skiprows=4196, header=None)
I can't seem to find anything on Stack Overflow that will allow me to import the desired section of data without hard coding how many header lines to skip. Each text file I need to import has different number of rows to skip and different number or rows of data to import.
You could write a script to extract the data into a separate file. Take a look at readlines()
:
with open('Name_of_file.txt', 'r', encoding='utf-8') as file:
lines = file.readlines()
Now you have a list of all lines in the file. Get a list of the indices of the lines that are dashes: (using https://www.w3schools.com/python/ref_string_startswith.asp)
dash_lines = [i for i, line in enumerate(lines) if line.startswith('-')]
Now you can easily get the index where your data starts:
start_index = dash_lines[-2] + 1 # one after second-to-last dashed line
data_lines = lines[start_index:-1] #don't include last line, which is also dashed