I am trying to write my first python script below. I want to search through a read only archive on an HPC to look in zipfiles contained within folders with a variety of other folder/file types. If the zip contains a .kml file I want to print the line in there starting with the string <coordinates>
.
import zipfile as z
kfile = file('*.kml') #####breaks here#####
folderpath = '/neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/01/21' # folder with multiple folders and .zips
for zipfile in folderpath: # am only interested in the .kml files within the .zips
if kfile in zipfile:
with read(kfile) as k:
for line in k:
if '<coordinates>' in line: # only want the coordinate line
print line # print the coordinates
k.close()
Eventually I want to loop this through multiple folders rather than pointing to the exact folder location ie loop thorough every sub folder in here /neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/
but this is a starting point for me to try and understand how python works.
I am sure there are many problems with this script before it will run but the current one I have is
kfile = file('*.kml')
IOError: [Errno 22] invalid mode ('r') or filename: '*.kml'
Process finished with exit code 1
Any help appreciated to get this simple process script working.
When you run:
kfile = file('*.kml')
You are trying to open a single file named exactly *.kml
, which is not what you want. If you want to process all *.kml
files, you will need to (a) get a list of matching files and then (b) process those files in a list.
There are a number of ways to accomplish the above; the easiest is probably the glob module, which can be used something like this:
import glob
for kfilename in glob.glob('*.kml'):
print kfilename
However, if you are trying to process a directory tree, rather than a single directory, you may instead want to investigate the os.walk function. From the docs:
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
A simple example might look something like this:
import os
for root, dirs, files in os.walk('topdir/'):
kfilenames = [fn for fn in files if fn.endswith('.kml')]
for kfilename in kfilenames:
print kfilename
Your script has:
for zipfile in folderpath:
That will simply iterate over the characters in the string folderpath
. E.g., the output of:
folderpath = '/neodc/sentinel1a/data/IW/L1_GRD/h/IPF_v2/2015/01/21'
for zipfile in folderpath:
print zipefile
Would be:
/
n
e
o
d
c
/
s
e
n
t
i
n
e
l
1
a
/
...and so forth.
Your code has:
with read(kfile) as k:
There is no read
built-in, and the .read
method on files cannot be used as a context manager.
You're looking for "lines beginning with <coordinate>
", but KML files are not line based. An entire KML could be a single line and it would still be valid.
Your are much better off using an XML parser to parse XML.