Search code examples
linuxbashubuntulftp

Bash: Loop through file and read substring as argument, execute multiple instances


How it is now

I currently have a script running under windows that frequently invokes recursive file trees from a list of servers.

I use an AutoIt (job manager) script to execute 30 parallel instances of lftp (still windows), doing this:

lftp -e "find .; exit" <serveraddr>

The file used as input for the job manager is a plain text file and each line is formatted like this:

<serveraddr>|...

where "..." is unimportant data. I need to run multiple instances of lftp in order to achieve maximum performance, because single instance performance is determined by the response time of the server.

Each lftp.exe instance pipes its output to a file named

<serveraddr>.txt

How it needs to be

Now I need to port this whole thing over to a linux (Ubuntu, with lftp installed) dedicated server. From my previous, very(!) limited experience with linux, I guess this will be quite simple.

What do I need to write and with what? For example, do I still need a job man script or can this be done in a single script? How do I read from the file (I guess this will be the easy part), and how do I keep a max. amount of 30 instances running (maybe even with a timeout, because extremely unresponsive servers can clog the queue)?

Thanks!


Solution

  • Parallel processing

    I'd use GNU/parallel. It isn't distributed by default, but can be installed for most Linux distributions from default package repositories. It works like this:

    parallel echo ::: arg1 arg2
    

    will execute echo arg1 and and echo arg2 in parallel.

    So the most easy approach is to create a script that synchronizes your server in bash/perl/python - whatever suits your fancy - and execute it like this:

    parallel ./script ::: server1 server2

    The script could look like this:

    #!/bin/sh
    #$0 holds program name, $1 holds first argument.
    #$1 will get passed from GNU/parallel. we save it to a variable.
    server="$1"
    lftp -e "find .; exit" "$server" >"$server-files.txt"
    

    lftp seems to be available for Linux as well, so you don't need to change the FTP client.

    To run max. 30 instances at a time, pass a -j30 like this: parallel -j30 echo ::: 1 2 3

    Reading the file list

    Now how do you transform specification file containing <server>|... entries to GNU/parallel arguments? Easy - first, filter the file to contain just host names:

    sed 's/|.*$//' server-list.txt
    

    sed is used to replace things using regular expressions, and more. This will strip everything (.*) after the first | up to the line end ($). (While | normally means alternative operator in regular expressions, in sed, it needs to be escaped to work like that, otherwise it means just plain |.)

    So now you have list of servers. How to pass them to your script? With xargs! xargs will put each line as if it was an additional argument to your executable. For example

    echo -e "1\n2"|xargs echo fixed_argument
    

    will run

    echo fixed_argument 1 2
    

    So in your case you should do

    sed 's/|.*$//' server-list.txt | xargs parallel -j30 ./script :::
    

    Caveats

    Be sure not to save the results to the same file in each parallel task, otherwise the file will get corrupt - coreutils are simple and don't implement any locking mechanisms unless you implement them yourself. That's why I redirected the output to $server-files.txt rather than files.txt.