Search code examples
python-3.xpandasread-csv

Pandas text file import: automate selection of 1 table when many exist within each file


Using Pandas 2.2.1 within Python 3.12, I'm importing data from text files that have a varying number of rows of metadata and multiple tables that precede the actual data to be imported. The number of rows that need to be skipped is typically between 3500 to 4500 when using the "skiprows" option within the "pd.read_csv" approach. The rows of data that are needed are always located at the very end of the text file and are bracketed by a row of dashes. The file structure (overly simplified) looks like this:

Archive generated by archiveSST02e.f                                                         
GEOPHYSICAL LOG
Borehole fluid type:                                           WATER            
Borehole fluid resistivity/conductivity:                       see log          
Borehole fluid temperature:                                    see log                                                                               
CTD manufacturer and SN:                                       Sea and Sun Tech.   

***Thousands of rows of more text and tables of unused data***

Then the data that is actually needed.
The last row of dashes is the last row of the text file.
--------------------------------------------------------------------------------------        
                                         Spec.   Spec.                                                    
        Corr.                            cond.   cond.                                                       
Alt.     sub.     Sub.   Cond.   Temp.   CTD     formula    Press.     Dens.                                 
feet     feet     feet   uS/cm   deg C   uS/cm   uS/cm      dbar      kg/m^3        
--------------------------------------------------------------------------------------
0.38     0.53     0.13    1067   26.870   1028    1030       0.04     996.93 
0.24     0.67     0.27    1048   26.794   1012    1014       0.08     996.95
0.12     0.79     0.38    1014   26.762    980     981       0.11     996.95
-0.34     1.25     0.84     842   26.785    813     814       0.25     996.88
...
-0.34     1.25     0.84     825   26.774    797     798       0.25     996.87
--------------------------------------------------------------------------------------

Example script used to import the text file data using the following python script:

df = pd.read_csv('Name_of_file.txt',encoding='utf-8', delimiter=r"\s+", skiprows=4196, header=None)

I can't seem to find anything on Stack Overflow that will allow me to import the desired section of data without hard coding how many header lines to skip. Each text file I need to import has different number of rows to skip and different number or rows of data to import.


Solution

  • You could write a script to extract the data into a separate file. Take a look at readlines():

    with open('Name_of_file.txt', 'r', encoding='utf-8') as file:
            lines = file.readlines()
    

    Now you have a list of all lines in the file. Get a list of the indices of the lines that are dashes: (using https://www.w3schools.com/python/ref_string_startswith.asp)

    dash_lines = [i for i, line in enumerate(lines) if line.startswith('-')]
    

    Now you can easily get the index where your data starts:

    start_index = dash_lines[-2] + 1  # one after second-to-last dashed line
    data_lines = lines[start_index:-1] #don't include last line, which is also dashed