Search code examples
subprocessshdatabricksazure-databrickstar

Databricks subprocess vs os.system


I have the following shell command that I'm trying to run in databricks:

find /dbfs/mnt/data/ -name somename.tar.tar -exec tar -xvzf {} -C /dbfs/mnt/raw/data \;

When I run it as a shell command or using os.system as shown below in a databricks notebook it works and extracts the files. Shell:

%sh    
find /dbfs/mnt/data/ -name somename.tar.tar -exec tar -xvzf {} -C /dbfs/mnt/raw/data \;

python:

cmd = ['find', '/dbfs/mnt/data/', '-name', 'somename.tar.tar', '-exec', 'tar', '-xvzf', '{}', '-C', '/dbfs/mnt/raw/data', '\\;']
cmd_join = " ".join(cmd)
os.system(cmd_join)

But running it as a subprocess does not seem to do anything even if the cell runs successfully.

subprocess.run(cmd)

Why is this the case?


Solution

  • When you run subprocess with a list, you need to peel off any quotes or escapes which were necessary when you ran the command with a shell between you and the command line.

    Specifically, the backslash before the ; is necessary when you have a shell because the semicolon character by itself is a statement terminator in the shell. But now you don't have a shell; so, take it out.

    cmd = [
        'find', '/dbfs/mnt/data/', '-name', 'somename.tar.tar',
        '-exec', 'tar', '-xvzf', '{}', '-C', '/dbfs/mnt/raw/data', ';']
    

    Probably also add check=True to your command;

    s = subprocess.run(cmd, check=True)