Search code examples
pandasread-csv

Python pandas different separators for header and data


I wrote a nice script that parses into a pandas dataframe some monolitic output for easy plotting and manipulation.

I can read different sections of the lines and turn them into a bigger dataframe almost always

I can combine the following two:

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00
...
...

and

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00

...
...

by making them a dataframe with some crafty one liner pd.read_csv(filename, encoding='latin1', sep='\s+', skip_blank_lines=True, skiprows=lambda x: x not in rows_to_keep) once they are dataframes I can concatenate them.

HOWEVER, my problem arises when the headers suddenly also have one nasty space like bellow (notice the differce between 224Pu and 222 U). This last sections completely screws up my read_cvs.

  E*/MeV   222 U     223 U     224 U     225 U     226 U     227 U     228 U  
    48.0 0.722E-06 0.102E-03 0.306E-04 0.188E-06 0.671E-06 0.976E-06 0.977E-08
...
...

Is there any way where I can apply different "sep" settings for header and data?

Thank you in advance.

I tried different versions of sep, including sep=\s{2,} and it reads correctly the headers but now obviously it misses the data.

I do not see how adding more separators would help me as they are both spaces, or maybe I don't know enough about regexp.


Solution

  • You can handle headers and data separately:

    import re
    
    filename = 'data.csv'
    with open(filename) as fp:
        while True:
            row = fp.readline().strip()
            if row:
                headers = re.split(r'\s{2,}', row)
                break
        df = pd.read_csv(fp, sep='\s+', header=None, names=headers, ...)
    

    First case:

    >>> df
       E*/MeV         224Pu     225Pu     226Pu         227Pu  228Pu  229Pu  230Pu
    0    48.0  1.920000e-12  0.000001  0.000005  3.540000e-07    0.0    0.0    0.0
    

    Second case:

    >>> df
       E*/MeV         222 U     223 U     224 U         225 U         226 U         227 U         228 U
    0    48.0  7.220000e-07  0.000102  0.000031  1.880000e-07  6.710000e-07  9.760000e-07  9.770000e-09