Search code examples
gcloudgsutil

How to append more than 33 files in a gcloud bucket?


I use to append datasets in a bucket in gcloud using:

gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite

However, today when I tried to append some data the terminal prints the error CommandException: The compose command accepts at most 33 arguments.

I didn't know about this restriction. How can I append more than 33 files in my bucket? Is there another command line tool? I would like to avoid to create a virtual machine for what looks like a rather simple task.

I checked the help using gsutil help compose. But it didn't help much. There is only a warning saying "Note that there is a limit (currently 32) to the number of components that can be composed in a single operation." but no hint on a workaround.


Solution

  • Could you not do it recursively|batch?

    I've not tried this.

    Given an arbitrary list of files (FILES)

    While there is more than 1 file in FILES:

    1. Take the first n where n<=33 from FILES and gsutil compose into temp file
    2. If that succeeds, replace the n names in FILES with the 1 temp file.
    3. Repeat

    The file that remains is everything composed.

    Update

    The question piqued my curiosity and gave me an opportunity to improve my bash ;-)

    A rough-and-ready proof-of-concept bash script that generates batches of gsutil compose commands for arbitrary (limited by the string formatting %04) numbers of files.

    GSUTIL="gsutil compose"
    BATCH_SIZE="32"
    
    # These may be the same (or no) bucket
    SRC="gs://bucket01/"
    DST="gs://bucket02/"
    
    # Generate test LST
    FILES=()
    for N in $(seq -f "%04g" 1 100); do
        FILES+=("${SRC}/file-${N}")
    done
    
    function squish() {
      LST=("$@")
      LEN=${#LST[@]}
    
      if [ "${LEN}" -le "1" ]; then
        # Empty array; nothing to do
        return 1
      fi
    
      # Only unique for this configuration; be careful
      COMPOSITE=$(printf "${DST}/composite-%04d" ${LEN})
    
      if [ "${LEN}" -le "${BATCH_SIZE}" ]; then
        # Batch can be composed with one command
        echo "${GSUTIL} ${LST[@]} ${COMPOSITE}"
        return 1
      fi
    
      # Compose 1st batch of files
      # NB Provide start:size
      echo "${GSUTIL} ${LST[@]:0:${BATCH_SIZE}} ${COMPOSITE}"
    
      # Remove batch from LST
      # NB Provide start (to end is implied)
      REM=${LST[@]:${BATCH_SIZE}}
    
      # Prepend composite from above batch to the next run
      NXT=(${COMPOSITE} ${REM[@]})
    
      squish "${NXT[@]}"
    }
    
    squish "${FILES[@]}"
    

    Running with BATCH_SIZE=3, no buckets and 12 files yields:

    gsutil compose file-0001 file-0002 file-0003 composite-0012
    gsutil compose composite-0012 file-0004 file-0005 composite-0010
    gsutil compose composite-0010 file-0006 file-0007 composite-0008
    gsutil compose composite-0008 file-0008 file-0009 composite-0006
    gsutil compose composite-0006 file-0010 file-0011 composite-0004
    gsutil compose composite-0004 file-0012 composite-0002
    

    NOTE How composite-0012 is created by the first command but then knitted into the subsequent command.

    I'll leave it to you to improve throughput by not threading the output from each step into the next, parallelizing the gsutil compose commands across the list chopped into batches and then compose the batches.