Search code examples
pythonfor-loopsubprocess

How to call `subprocess` efficiently (and avoid calling it in a loop)


I have a Python script that contains a for-loop that iterates through a list of items. I need to perform a computation on a property of each item, but the code that does this computation is in Java (where the java main() method accepts two arguments: arg1 and arg2 in the example below). So far, so good -- I can use subprocess to call Java.

This is how I do it currently (simplified):

from subprocess import Popen, PIPE

cp = ... # my classpath string
java_file = ... # the file with the java code
arg1 = ... # an argument string (always the same value) 
items = [...] # my list of items
for item in items:
    args2 = ... # calculated from item inside the python script 
    cmd = ['java', '-cp', cp, java_file, arg1, arg2]
    process = Popen(cmd, stdout=PIPE, stderr=PIPE, shell=True)
    output, errors = process.communicate()
    outp_str = output.decode('utf-8') # the result I need

It works, but because my list can contain thousands of elements, I'd be calling subprocess as many times -- which seems very inefficient.

Is there a way in which I can call subprocess only once, before the loop, and then give the active subprocess the necessary command within the loop? Or would that make no sense in terms of speed/efficiency?

I found this question, which seems to be related -- but I can't manage to translate this to my scenario. I also did not find my solution in the docs for subprocess. I imagine it would be something like this:

cp = ... # my classpath string
java_file = ... # the file with the java code
arg1 = ... # an argument string (always the same value)
cmd = [...] # <-- ???
process = Popen(cmd, stdout=PIPE, stderr=PIPE, shell=True) 
items = [...] # my list of items
for item in items:
    args2 = ... # calculated from item inside the python script
    process.stdin.write(bytes(..., 'utf-8')) # <-- ???
    process.stdin.flush()
    result = process.stdout.readline() # the result I need 

... where I can't figure out what the two commands should be (in the lines that have the question marks).

Is what I want possible? Any help much appreciated!


Solution

  • Whether or not you can make your Python code more "efficient" depends on how the Java application is implemented. If the Java application can only receive input via command line arguments, then there's nothing you can do. You'll have to launch a new subprocess for every pair of arguments. But if your Java application is implemented to read from its standard input, then you can write to stdin from the Python side. What that would look like depends on the protocol decided by the Java application.

    You also ask what the Java command would look like if you can write to standard input. The command to launch the Java application is the same. What may be different is what command line arguments, if any, you need to pass to the Java application. And that again depends on how the Java application is implemented.

    Note it would be cheaper to reuse the same subprocess rather than launching a new one for each pair of arguments. Especially with Java's relatively high boot time. But whether or not you can do this depends on the Java application.

    If you control the Java application, you can modify it to read its standard input. Though you may also want to consider simply having the Java application accept more than two command line arguments (e.g., if the number of command line arguments is 40, then process them as if it was given 20 pairs).


    Here's an example Java application that can switch between "processing" two command line arguments and "processing" an unknown number of argument pairs from standard input.

    package sample;
    
    import java.nio.charset.StandardCharsets;
    import java.util.Scanner;
    
    public class Main {
    
        public static void main(String[] args) {
            if (args.length == 1 && args[0].equals("--use-stdin")) {
                processArgsFromStandardInput();
            } else if (args.length == 2) {
                processArgs(args[0], args[1]);
            } else {
                System.err.println("Illegal command line. Must be --use-stdin or 2 arguments.");
                System.exit(1);
            }
        }
        
        static void processArgsFromStandardInput() {
            Scanner scanner = new Scanner(System.in, StandardCharsets.UTF_8);
            scanner.useDelimiter(",");
            
            while (scanner.hasNext()) {
                String arg1 = scanner.next();
                String arg2 = scanner.next();
                processArgs(arg1, arg2);
            }
        }
        
        static void processArgs(String arg1, String arg2) {
            System.out.printf("Processing args: %s, %s%n", arg1, arg2);
        }
    }
    

    I chose to use a Scanner, but you can use whatever you want (e.g., BufferedReader, DataInputStream, etc.). The important part is that the source of data is System.in (standard input). I also chose to use "," as the delimiter between arguments. Again, that was an arbitrary choice, and you can use whatever you want. Though it means the arguments can't contain commas themselves (I don't provide a way to "escape" a comma). Note using UTF-8 encoding and commas as the delimiter is the "protocol" I mentioned earlier.

    And here's an example Python script that invokes the Java application (compiled and packaged into a JAR file) twice, once for each "mode":

    import subprocess
    import sys
    from subprocess import Popen, PIPE
    from time import time
    
    
    def measure_time(func):
        def wrapper(*args):
            start = time()
            func(*args)
            end = time()
            print(f'Function took {end - start:.2f} seconds.')
        return wrapper
    
    
    # implementation from https://stackoverflow.com/a/5389547/6395627
    def pairwise(iterable):
        a = iter(iterable)
        return zip(a, a)
    
    
    @measure_time
    def invoke_args(jarfile, args):
        for arg1, arg2 in pairwise(args):
            subprocess.run(['java', '-jar', jarfile, arg1, arg2])
    
    
    @measure_time
    def invoke_stdin(jarfile, args):
        with Popen(['java', '-jar', jarfile, '--use-stdin'], stdin=PIPE) as proc:
            for arg1, arg2 in pairwise(args):
                proc.stdin.write(f'{arg1},{arg2},'.encode())
    
    
    if __name__ == '__main__':
        jarfile = sys.argv[1]
        args = [f'arg{i}' for i in range(1, 41)]
    
        print('========== COMMAND LINE ARGS ==========')
        invoke_args(jarfile, args)
        print()
    
        print('========= STANDARD INPUT ==========')
        invoke_stdin(jarfile, args)
        print()
    

    If you invoke the above Python script (passing an appropriate JAR file path and assuming you have Java on your path), then you should see output similar to:

    ========== COMMAND LINE ARGS ==========
    Processing args: arg1, arg2
    Processing args: arg3, arg4
    Processing args: arg5, arg6
    Processing args: arg7, arg8
    Processing args: arg9, arg10
    Processing args: arg11, arg12
    Processing args: arg13, arg14
    Processing args: arg15, arg16
    Processing args: arg17, arg18
    Processing args: arg19, arg20
    Processing args: arg21, arg22
    Processing args: arg23, arg24
    Processing args: arg25, arg26
    Processing args: arg27, arg28
    Processing args: arg29, arg30
    Processing args: arg31, arg32
    Processing args: arg33, arg34
    Processing args: arg35, arg36
    Processing args: arg37, arg38
    Processing args: arg39, arg40      
    Function took 2.46 seconds.        
    
    ========= STANDARD INPUT ==========
    Processing args: arg1, arg2  
    Processing args: arg3, arg4  
    Processing args: arg5, arg6  
    Processing args: arg7, arg8  
    Processing args: arg9, arg10 
    Processing args: arg11, arg12
    Processing args: arg13, arg14
    Processing args: arg15, arg16
    Processing args: arg17, arg18
    Processing args: arg19, arg20
    Processing args: arg21, arg22
    Processing args: arg23, arg24
    Processing args: arg25, arg26
    Processing args: arg27, arg28
    Processing args: arg29, arg30
    Processing args: arg31, arg32
    Processing args: arg33, arg34
    Processing args: arg35, arg36
    Processing args: arg37, arg38
    Processing args: arg39, arg40
    Function took 0.16 seconds.