Search code examples
parallel-processingcshgnu-parallel

How to parallelize csh while loop parallel with GNU-parallel


I have the following script that creates multiple objects.

I tried just simply running it in my terminal, but it seems to take so long. How can I run this with GNU-parallel?

The script below creates an object. It goes through niy = 1 through niy = 800, and for every increment in niy, it loops through njx = 1 to 675.

#!/bin/csh


set njx = 675 ### Number of grids in X
set niy = 800  ### Number of grids in Y
set ll_x = -337500 
set ll_y = -400000 ### (63 / 2) * 1000 ### This is the coordinate at lower right corner
set del_x = 1000
set del_y = 1000

rm -f out.shp
rm -f out.shx
rm -f out.dbf
rm -f out.prj


shpcreate out polygon    
dbfcreate out -n ID1 10 0 



@ n = 0 ### initilzation of counter (n) to count gridd cells in loop
@ iy = 1  ### initialization of conunter (iy) to count grid cells along north-south direction

echo ### emptly line on screen

while ($iy <= $niy)  ### start the loop for norht-south direction
   echo ' south-north'   $iy '/' $niy ### print a notication on screen

   @ jx = 1 
   while ($jx <= $njx)### start the loop for east-west direction 
      @ n++ 


      set x = `echo $ll_x $jx $del_x | awk '{print $1 + ($2 - 1) * $3}'`
      set y = `echo $ll_y $iy $del_y | awk '{print $1 + ($2 - 1) * $3}'`
      set txt = `echo $x $y $del_x $del_y | awk '{print $1, $2, $1, $2 + $4, $1 + $3, $2 + $4, $1 + $3, $2, $1, $2}'`

      shpadd out `echo $txt`
      dbfadd out $n

      @ jx++
   end ### close the second loop

   @ iy++
end ### close the first loop

echo 



### lines below create a projection file for the created shapefile using

cat > out.prj  << eof
PROJCS["Asia_Lambert_Conformal_Conic",GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",120.98],PARAMETER["Standard_Parallel_1",5.0],PARAMETER["Standard_Parallel_2",20.0],PARAMETER["Latitude_Of_Origin",14.59998],UNIT["Meter",1.0]]
eof

###
###
###



Solution

  • The inner part gets executed 540,000 times and on each iteration you invoke 3 awk processes to do 3 simple bits of maths... that's 1.6 million awks.

    Rather than that, I have written a single awk to generate all the loops and do all the maths and this can then be fed into bash or csh to actually execute it.

    I wrote this and ran it completely in the time the original version got to 16% through. I have not checked it extremely thoroughly, but you should be able to readily correct any minor errors:

    #!/bin/bash
    
    awk -v njx=675 -v niy=800 -v ll_x=-337500 -v ll_y=-400000 '
       BEGIN{
          print "shpcreate out polygon"
          print "dbfcreate out -n ID1 10 0"
          n=0
    
          for(iy=1;iy<niy;iy++){
             for(jx=1;jx<njx;jx++){
                n++
                x = llx + (jx-1)*1000
                y = lly + (iy-1)*1000
                txt = sprintf("%d %d %d %d %d %d %d %d %d %d",x,y,x, y+dely, x+delx, y+dely, x+delx,y,x,y)
                print "shpadd out",txt
                print "dbfadd out",n
             }
          }
       }' /dev/null
    

    If the output looks good, you can then run it through bash or csh like this:

    ./MyAwk | csh
    

    Note that I don't know anything about these Shapefile (?) tools, shpadd or dbfadd tools. They may or may not be able to be run in parallel - if they are anything like sqlite running them in parallel will not help you much. I am guessing the changes above are enough to make a massive improvement to your runtime. If not, here are some other things you could think about.

    • You could append an ampersand (&) to each line that starts dbfadd or shpadd so that several start in parallel, and then print a wait after every 8 lines so that you run 8 things in parallel in chunks.

    • You could feed the output of the script directly into GNU Parallel, but I have no idea if the ordering of the lines is critical.

    • I presume this is creating some sort of database. It may be faster if you run it on a RAM-backed filesystem, such as /tmp.

    • I notice there is a Python module for manipulating Shapefiles here. I can't help thinking that would be many, many times faster still.