Search code examples
pythonpython-2.7kmlzip

Python2.7 search zipfiles for .kml containing string without unzipping


I am trying to write my first python script below. I want to search through a read only archive on an HPC to look in zipfiles contained within folders with a variety of other folder/file types. If the zip contains a .kml file I want to print the line in there starting with the string <coordinates>.

import zipfile as z 
kfile = file('*.kml') #####breaks here#####
folderpath = '/neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/01/21'  # folder with multiple folders and .zips
for zipfile in folderpath:  # am only interested in the .kml files within the .zips
    if kfile in zipfile:
        with read(kfile) as k:
            for line in k:
                if '<coordinates>' in line:  # only want the coordinate line
                    print line  # print the coordinates
k.close()

Eventually I want to loop this through multiple folders rather than pointing to the exact folder location ie loop thorough every sub folder in here /neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/ but this is a starting point for me to try and understand how python works.

I am sure there are many problems with this script before it will run but the current one I have is

kfile = file('*.kml')
IOError: [Errno 22] invalid mode ('r') or filename: '*.kml'
Process finished with exit code 1

Any help appreciated to get this simple process script working.


Solution

  • When you run:

    kfile = file('*.kml')
    

    You are trying to open a single file named exactly *.kml, which is not what you want. If you want to process all *.kml files, you will need to (a) get a list of matching files and then (b) process those files in a list.

    There are a number of ways to accomplish the above; the easiest is probably the glob module, which can be used something like this:

    import glob
    for kfilename in glob.glob('*.kml'):
        print kfilename
    

    However, if you are trying to process a directory tree, rather than a single directory, you may instead want to investigate the os.walk function. From the docs:

    Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).

    A simple example might look something like this:

    import os
    for root, dirs, files in os.walk('topdir/'):
        kfilenames = [fn for fn in files if fn.endswith('.kml')]
        for kfilename in kfilenames:
            print kfilename
    

    Additional commentary

    Iterating over strings

    Your script has:

    for zipfile in folderpath:
    

    That will simply iterate over the characters in the string folderpath. E.g., the output of:

    folderpath = '/neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/01/21'
    for zipfile in folderpath:
        print zipefile
    

    Would be:

    /
    n
    e
    o
    d
    c
    /
    s
    e
    n
    t
    i
    n
    e
    l
    1
    a
    /
    

    ...and so forth.

    read is not a context manager

    Your code has:

    with read(kfile) as k:
    

    There is no read built-in, and the .read method on files cannot be used as a context manager.

    KML is XML

    You're looking for "lines beginning with <coordinate>", but KML files are not line based. An entire KML could be a single line and it would still be valid.

    Your are much better off using an XML parser to parse XML.