Search code examples
pythonpandasdataframeascii

ASC files not preserving empty columns when added to df Python


I have a load of ASC files to extract data from. The issue I am having is that some of the columns have empty rows where there is no data, when I load these files into a df - it populates the first columns with all the data and just adds nans to the end... like this:

a| b| c

1 | 2 | nan

when I want it to be:

a | b | c

1 |nan|2

(I can't figure out how to make a table here to save my life) but where there is no data I want it to preserve the space. Part of my code says the separator is any space with more than two white spaces so I can preserve the headers that have one space within them, I think this is causing the issue but I am not sure how to fix it. I've tried using astropy.io to open the files and determine the delimiter but I get the error that the number of columns doesn't match the data columns. here's an image of the general look of the files I have so you can see the lack of char delimiters and empty columns. enter image description here

starting_words = ['Core no.', 'Core No.','Core','Core  no.']       
data = []
file_paths = []

for file in filepaths:
with open(file) as f:
   
    for i, l in enumerate(f):
        if l.startswith(tuple(starting_words)):
            
            df = (pd.read_csv(file,sep = '\\s{2,}', engine = 'python', skiprows = i))
            file_paths.append((file.stem + file.suffix))
            df.insert(0,'Filepath', file)

            data += [df]
            break

this is the script that I've used to open the files and keep the header words together, I never got the astropy stuff to run - I either get the columns dont match error or it could not determine the file format.Also, this code has the skiprows part because the files all have random notes at the top that I don't want in my dataframe.


Solution

  • Your data looks well behaved, you could try to make use of the Pandas fwf to read the files with fixed-width formatted lines. If the inference from the fwf is not good enough for you, you can manually describe the extents of the fixed-width fields of each line using the parameter colspecs.

    Sample

    Core no.   Depth       Depth  Perm  Porosity  Saturations Oil
                  ft           m    mD         %                %
    
           1  5516.0    1681.277            40.0              1.0
           2  5527.0    1684.630            39.0              16.0
           3  5566.0    1696.517   508      37.0              4.0
              5571.0    1698.041   105      33.0              8.0
           6  5693.0    1735.226            44.0              16.0
              5702.0    1737.970   4320     35.0              31.0
           9  5686.0    1733.093   2420     33.0              26.0
    
    df = pd.read_fwf('sample.txt', skiprows=2, header=None)
    df.columns=['Core no.', 'Depth ft', 'Depth m' , 'Perm mD', 'Porosity%', 'Saturations Oil%']
    print(df)
    

    Output from df

       Core no.  Depth ft   Depth m  Perm mD  Porosity%  Saturations Oil%
    0       1.0    5516.0  1681.277      NaN       40.0               1.0
    1       2.0    5527.0  1684.630      NaN       39.0              16.0
    2       3.0    5566.0  1696.517    508.0       37.0               4.0
    3       NaN    5571.0  1698.041    105.0       33.0               8.0
    4       6.0    5693.0  1735.226      NaN       44.0              16.0
    5       NaN    5702.0  1737.970   4320.0       35.0              31.0
    6       9.0    5686.0  1733.093   2420.0       33.0              26.0