Search code examples
pythonpython-2.7binaryfilesenumerate

How can I ignore binary data headers (or just read the serialized data) when enumerating a file with Python?


I have a file that has a number of headers of binary data (I suppose that is what it is) and after that, there are lines of text. I'm just starting to work with it, but I noticed that if I use the Python "enumerate" function it doesn't appear to read the lines I want it to read (I'm using Python 2.7.8). It is not returning the lines I'm interested in. In my text editor I can see the data I want but the result indicates maybe it is "serialized data"? There is more of the same binary at the end of the file.

Sample from Data File (I'm hoping to skip the first 8 lines): I want to start with the line that starts with "curve".

    ÿÿÿÿ          ENetDeedPlotter, Version=5.6.1.0, Culture=neutral, PublicKeyToken=null   QSystem.Drawing, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a   Net_Deed_Plotter.SerializeData!   LinesOfDataNumberOfTractsSeditortextSedLineStract
SNoteArraySNorthArrow
Slandscape
SPaperSizeSPaperBounds
SPrinterScaleSPrinterScaleStrSAllTractsMouseOffsetNSAllTractsMouseOffsetESAllTractsNOffsetSAllTractsEOffsetSImageScroll_YSImageScroll_XSImage_YSImage_XSImageFilePath
SUpDateMapSttcSttStbSboSnb
STitleText  SDateText   SPOBLines
SLabelCornersSNAmountTract0HasBeenMovedSEAmountTract0HasBeenMoved                      Net_Deed_Plotter.LineData[]   Net_Deed_Plotter.TractData[]   System.Collections.ArrayList+Net_Deed_Plotter.PaperForm+NorthArrowStruct   !System.Drawing.Printing.PaperSize   System.Drawing.Rectangle      '         Ân40.4635w 191.02
curve right radius 953.50 arc 361.84 chord n60.5705e 359.07
s56.3005e 3.81
s19.4515w 170.63
s13.4145w 60.67
s51.0250w 155.35
n40.4635w 191.02
curve left radius 615.16 arc 202.85 chord s45.19w 201.94

Sample Script

# INPUTS TO BE UPDATED
inputNDP = r"N:\Parcels\Parcels2012\57-11-115.ndp"
outputTXT = r"N:\Parcels\Parcels2012\57-11-115.txt"
# END OF INPUTS TO BE UPDATED
fileNDP = open(inputNDP, 'r')
for line in enumerate(9, fileNDP):
    print line

Result

(9, '\x00\x01\x00\x00\x00\xff\xff\xff\xff\x01\x00\x00\x00\x00\x00\x00\x00\x0c\x02\x00\x00\x00ENetDeedPlotter, Version=5.6.1.0, Culture=neutral, PublicKeyToken=null\x0c\x03\x00\x00\x00QSystem.Drawing, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a\x05\x01\x00\x00\x00\x1eNet_Deed_Plotter.SerializeData!\x00\x00\x00\x0bLinesOfData\x0eNumberOfTracts\x0bSeditortext\x07SedLine\x06Stract\n')
(10, 'SNoteArray\x0bSNorthArrow\n')
(11, 'Slandscape\n')
(12, 'SPaperSize\x0cSPaperBounds\rSPrinterScale\x10SPrinterScaleStr\x16SAllTractsMouseOffsetN\x16SAllTractsMouseOffsetE\x11SAllTractsNOffset\x11SAllTractsEOffset\x0eSImageScroll_Y\x0eSImageScroll_X\x08SImage_Y\x08SImage_X\x0eSImageFilePath\n')
(13, 'SUpDateMap\x04Sttc\x03Stt\x03Stb\x03Sbo\x03Snb\n')
(14, 'STitleText\tSDateText\tSPOBLines\rSLabelCorners')
>>> 

Solution

  • I figured out that since the file was in a binary format, I needed to read it in that way with open('myfile', 'rb') rather than with open('myfile', 'r') and I got a lot of help from this question.

    The re-write looks like this ...

    #ToDO write output file
    # INPUTS TO BE UPDATED
    inputNDP = r"N:\Parcels\Parcels2012\57-11-115.ndp"
    # END OF INPUTS TO BE UPDATED
    fileNDP = open(inputNDP, 'rb')
    def strip_nonascii(b):
        return b.decode('ascii', errors='ignore')
    
    n = 0
    for line in fileNDP:
        if n > 5:
            if '|' in line:
                break
            print(strip_nonascii(line)).strip('\n') # + str(n)
        n += 1