I have a general question regarding multiply-nested statements. For "complicated nesting" (>3/4 layers), what is a better approach, especially when iterating AND using if-statements?
I have a lot of files, some of which are in sub-directories, others which are in the root directory. There are a number of directories from which I want to extract datasets and append to a target dataset (the master).
for special_directory in directorylist:
for dataset in special_directory:
if dataset in list_of_wanted:
some_code
if it_already_exists:
for feature_class in dataset:
if feature_class in list_of_wanted:
and then I really get into the meat of the code processing. Frankly, I can't think of a way to avoid these nested conditional and looping statements. Is there something I am missing? Should I be using "while" instead of "for"?
My actual specific code works. It just doesn't move very quickly. It is iterating over 27 databases to append the contents of each to a new target database. My python has been running for 36 hours and is through 4/27. Tips?
I posted this in the GIS stack exchange, but my question is really too general for it to belong there: question and more specific code
Any tips? What are best practices in this regard? This is already a subset of the code. This looks for datasets and feature classes within them, within geodatabses from a list generated from another script. A third script looks for feature classes stored in geodatabases (i.e. not within datasets).
ds_wanted = ["Hydrography"]
fc_wanted = ["NHDArea","NHDFlowline","NHDLine","NHDWaterbody"]
for item in gdblist:
env.workspace = item
for dsC in arcpy.ListDatasets():
if dsC in ds_wanted:
secondFD = os.path.join(gdb,dsC)
if arcpy.Exists(secondFD):
print (secondFD + " exists, not copying".format(dsC))
for fcC in arcpy.ListFeatureClasses(feature_dataset=dsC):
if fcC in fc_wanted:
secondFC2 = os.path.join(gdb,dsC, fcC)
if arcpy.Exists(secondFC2):
targetd2 = os.path.join(gdb,dsC,fcC)
# Create FieldMappings object and load the target dataset
#
print("Now begin field mapping!")
print("from {} to {}").format(item, gdb)
print("The target is " + targetd2)
fieldmappings = arcpy.FieldMappings()
fieldmappings.addTable(targetd2)
# Loop through each field in the input dataset
#
inputfields = [field.name for field in arcpy.ListFields(fcC) if not field.required]
for inputfield in inputfields:
# Iterate through each FieldMap in the FieldMappings
for i in range(fieldmappings.fieldCount):
fieldmap = fieldmappings.getFieldMap(i)
# If the field name from the target dataset matches to a validated input field name
if fieldmap.getInputFieldName(0) == inputfield.replace(" ", "_"):
# Add the input field to the FieldMap and replace the old FieldMap with the new
fieldmap.addInputField(fcC, inputfield)
fieldmappings.replaceFieldMap(i, fieldmap)
break
# Perform the Append
#
print("Appending stuff...")
arcpy.management.Append(fcC, targetd2, "NO_TEST", fieldmappings)
else:
arcpy.Copy_management(fcC, secondFC2)
print("Copied " +fcC+ "into " +gdb)
else:
pass
else:
arcpy.Copy_management(dsC,secondFD) # Copies feature class from first gdb to second gdb
print "Copied "+ dsC +" into " + gdb
else:
pass
print "{} does not need to be copied to DGDB".format(dsC)
print("Done with datasets and the feature classes within them.")
It seems to really get caught on arcpy.management.Append I have some fair experience with this function and despite that this is a larger than typical table schema (more records, more fields), a single append is taking 12+ hours. To build on my original question, could this be because it is so deeply nested? Or is this not the case and the data simply requires time to process?
Some good comments in response to your question. I've limited experience with multiprocessing, but having all your computer cores working will often speed things up. If you have a four-core processor that is only running around 25% during script execution, then you can potentially benefit. You just need to be careful how you apply it in case one thing needs to always happen before another. If you are working with file geodatabases rather than enterprise gdb's, then your bottleneck may be with the disk. If the gdb is remote, network speed may be the issue. Either way, multiprocessing won't help. Resource monitor on Windows will give you a general idea on how much processor/disk/RAM/network is utilized.
I just used a similar script using rpy2 and data from/to PostGIS. It still took ~30 hours to run, but that is much better than 100. I haven't used multiprocessing yet in Arc (I mostly work in open source), but know of people who have.
A very simple implementation of multiprocessing:
from multiprocessing import Pool
def multi_run_wrapper(gdblist):
"""Helper function to unpack argument lists during multiprocessing.
Modified from: http://stackoverflow.com/a/21130146/4062147"""
return gdb_append(*gdblist) # the * unpacks the list
def gdb_append(gdb_id):
...
# script starts here #
gdblist = [......]
if __name__ == '__main__':
p = Pool()
p.map(multi_run_wrapper, gdblist)
print("Script Complete")
Normally you would join the results of the pools, but since you are using this to execute tasks I'm not sure this is necessary. Somebody else may be able to chime in as to what is best practice.