Search code examples
hadoophdfs

Get specific files while keeping directory structure from HDFS


I have a directory structure looking like that on my HDFS system:

/some/path
  ├─ report01
  │    ├─ file01.csv
  │    ├─ file02.csv
  │    ├─ file03.csv
  │    └─ lot_of_other_non_csv_files
  ├─ report02
  │    ├─ file01.csv
  │    ├─ file02.csv
  │    ├─ file03.csv
  │    ├─ file04.csv
  │    └─ lot_of_other_non_csv_files
  └─ report03
       ├─ file01.csv
       ├─ file02.csv
       └─ lot_of_other_non_csv_files

I would like to copy to my local system all CSV files while keeping the directory structure.

I tried hdfs dfs -copyToLocal /some/path/report* but that method copies a lot of unnecessary (and quite large) files that I don't want to get.

I also tried hdfs dfs -copyToLocal /some/path/report*/file*.csv but this does not preserve the directory structure and HDFS complains that the files already exist when it tries to copy the files from the folder report02.

Is there a way to get only files matching a specific pattern while still keeping the original directory structure?


Solution

  • As it seems that there isn't any solution directly implemented in Hadoop, I finally ended up creating my own bash script:

    #!/bin/bash
    
    # pattern of files to get
    TO_GET=("*.csv$" "*.png$")
    # pattern of files/directories to avoid
    TO_AVOID=("*_temporary*")
    
    # function to join an array by a specified separator:
    # usage: join_arr ";" ${array[@]}
    join_arr() {
      local IFS="$1"
      shift
      echo "$*"
    }
    
    if (($# != 2))
    then
        echo "There should be two parameters (path of the directory to get and destination)."
    else
        # ensure that the provided path ends with a slash
        [[ "$1" != */ ]] && path="$1/" || path="$1"
        echo "Path to copy: $path"
        # ensure that the provided destination ends with a slash and append result directory name
        [[ "$2" != */ ]] && dest="$2/" || dest="$2"
        dest="$dest$(basename $path)/"
        echo "Destination: $dest"
        # get name of all files matching the patterns
        echo -n "Exploring path to find matching files... "
        readarray -t files < <(hdfs dfs -ls -R "$path" | egrep -v "$(join_arr "|" "${TO_AVOID[@]}")" | egrep "$(join_arr "|" "${TO_GET[@]}")" | awk '{print $NF}' | cut -c $((${#path}+1))-)
        echo "Done!"
        # check if at least one file found
        [ -z "$files" ]  && echo "No files matching the patern."
        # get files one by one
        for file in ${files[@]}
        do
            path_and_file="$path$file"
            dest_and_file="$dest$file"
            # make sure the directory exist on the local file system
            mkdir -p "$(dirname "$dest_and_file")"
            # get file in a separate process to be able to execute the queries in parallel
            (hdfs dfs -copyToLocal -f "$path_and_file" "$dest_and_file" && echo "$file") &
        done
        # wait for all queries to be finished
        wait
    fi
    

    You can call the script like that:

    $ script.sh "/some/hdfs/path/folder_to_get" "/some/local/path"
    

    The script will create a directory folder_to_get in /some/local/path with all CSV and PNG files, respecting the directory structure.

    Note: If you want to get other files than CSV and PNG, just modify the variable TO_GET at the top of the script. You can also modify the TO_AVOID variable to filter directories that you don't want to scan even if they contains CSV or PNG files.