Search code examples
pythonfilecsvgenfromtxt

python How to split irreguarly delimited file?


I have an interesting problem that has arisen due to some newly formatted and poorly structured data files. They are text files, comma delimited, that contain multiple sets of data each with a unique header. Originally I was using genFromTxt to read in only one instance of data with one header. Now with the multiple instances, genFromTxt just cant handle it. What would be the best way to approach splitting the file up and feeding each individual instance into genFromTxt? Here is an example of the file. Data from the first instance immediately butts up to the header of the second instance. This repeats around 20 times per file. I have not found a common delimiter to be able to separate them yet.

       0.8 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
       0.5 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0   72.380  -7.761 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
       0.3 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
       0.0 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0   72.381  -7.760 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
      -1.0  906.7  20.0  18.9  92.8  -10.1   -3.7  10.7  70.0 999.0   72.380  -7.761 999.0 999.0   953.8  1.0  1.0  1.0  1.0  1.0  9.0
    Data Type:                         AVAPS SOUNDING DATA, Channel 2/Descending
    Project ID:                        DYNAMO
    Release Site Type/Site ID:         NOAA P3/N43RF 20111116I1
    Release Location (lon,lat,alt):    072 12.04'E, 08 11.50'S, 72.201, -8.192, 966.4
    UTC Release Time (y,m,d,h,m,s):    2011, 11, 16, 04:22:07
    Reference Launch Data Source/Time: IWGADTS Format (IWG1)/04:22:07
    Sonde Id:                          110355308
    System Operator/Comments:          TMR/none, Good Drop
    Post Processing Comments:          Aspen Version 3.1; Created on 01 Feb 2012 23:18 UTC; Configuration research-dropsonde
    /
    /
    Nominal Release Time (y,m,d,h,m,s):2011, 11, 16, 04:22:07
     Time  Press  Temp  Dewpt  RH    Ucmp   Vcmp   spd   dir   Wcmp     Lon     Lat   Ele   Azi    Alt    Qp   Qt   Qrh  Qu   Qv   QdZ
      sec    mb     C     C     %     m/s    m/s   m/s   deg   m/s      deg     deg   deg   deg     m    code code code code code code
    ------ ------ ----- ----- ----- ------ ------ ----- ----- ----- -------- ------- ----- ----- ------- ---- ---- ---- ---- ---- ----
      89.8 1011.6  27.3  23.9  81.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0     0.0  1.0  1.0  1.0  9.0  9.0  9.0

Solution

  • You may adapt a code like... (Python3 warning, if you want to run this in Python2.7+, replace range() with xrange() (for efficiency purposes))

    def readSacredAttribute(holyInput):
        raw = [ x.strip() for x in holyInput.readline()[:-1].split(':') ]
        newRaw = []
        for i in range(len(raw) - 1):
            for x in [ x.strip() for x in raw[ i + 1 ].split(',') ]:
                newRaw.append(x)
    
        raw[ 1 : ] = newRaw
    
        parameters = {}
        if '(' in raw[0]:
            base = raw[0].index('(') + 1
            to = raw[0].index(')')
            splitted = [ x.strip() for x in raw[1].split(',') ]
            for i, x in enumerate([ x.strip() for x in raw[0][ base : to ].split(',') ]):
                parameters[x] = splitted[i]
    
        return (raw, parameters)
    
    def splitThisStupidMess(holyInput):
        holyHeader = []
        for i in range(6):
            holyHeader.append([ float(x) for x in holyInput.readline().split()])
    
        sacredAttributes = { x[0][0] : (x[0][1], x[1]) for x in [  readSacredAttribute(holyInput) for i in range(9) ] }
    
        # Ignore the '\' lines
        for i in range(2):
            holyInput.readline()
    
        nominalTime = readSacreAttribute(holyInput)
        sacredAttributes[nominalTime[0][0]] = (nominalTime[0][1], nominalTime[1])
    
        divineNames = holyInput.readline().split()
        divineUnits = holyInput.readline().split()
        holyInput.readline()    # Avoid decoration...
        divineValues = [ float(x) for x in holyInput.readline().split() ]
    
        divineFooter = { divineNames[i] : (divineUnits[i], divineValues[i]) for i in len(divineNames) }
    
        return (holyHeader, sacredAttributes, divineFooter)