Search code examples
shellapache-sparkunixhadoop2

Why this Unzip shell script behaves differently when environment change Dev to Prod?


output_path=s3://output
unziped_dir=s3://2019-01-03
files=`hadoop fs -ls $output_path/ | awk '{print $NF}' | grep .gz$ | tr '\n' ' '`;
for f in $files
do   
echo "available files are: $f"
filename=$(hadoop fs -ls $f | awk -F '/' '{print $NF}' | head -1)
hdfs dfs -cat $f | gzip -d | hdfs dfs -put - $unziped_dir"/"${filename%.*}
echo "unziped file names: ${filename%.*}"
done

Output:

Dev:

available files are: s3://2019-01-03/File_2019-01-03.CSV.gz
unziped file names: File_2019-01-03.CSV
available files are: s3://2019-01-03/Data_2019-01-03.CSV.gz
unziped file names: Data_2019-01-03.CSV
available files are: s3://2019-01-03/Output_2019-01-03.CSV.gz
unziped file names: Output_2019-01-03.CSV

Prod:

available files are: s3://2019-01-03/File_2019-01-03.CSV.gz s3://2019-01-03/Data_2019-01-03.CSV.gz s3://2019-01-03/Output_2019-01-03.CSV.gz 
unziped file names: 

I am trying to look into a directory and identify the .gz files and iterate them to unzip the all .gz files and store into a different directory. But when am running this script in EMR dev cluster, its works fine. But in prod cluster its not. Please find the behavior of the script above.


Solution

  • There seems to be a problem with the word splitting in for f in $files. Normally the shell should split the value $files at the spaces as it does on Dev. On Dev f is set to one of the three words from $files in every cycle of the for loop, on Prod f gets the complete value of $files including spaces.

    Do you set variable IFS somewhere?

    If the problem is not in other parts of your script you should be able to reproduce the problem with a reduced script:

    files="foo bar baz"
    for f in $files
    do   
      echo "available files are: $f"
    done
    

    If this minimal script doesn't show a difference the problem is in other parts of your script.

    To see if the value of IFS is different on Dev and Prod you can add this to the minimal script or to your original script just before the for loop:

    # To see if IFS is different. With the default value (space, tab, newline) the output should be
    # 0000000   I   F   S   =   #      \t  \n   #  \n
    # 0000012
    echo "IFS=#${IFS}#" | od -c
    

    If you see a difference in the value of IFS you have to find out where IFS is modified.

    BTW: Normally you can omit | tr '\n' ' ' after the grep command. The shell should accept \n as word splitting character when processing for f in $files. If not, this is probably related to the source of your problem.

    Edit: There is a better solution to process the data line by line, see
    https://mywiki.wooledge.org/DontReadLinesWithFor and
    https://mywiki.wooledge.org/BashFAQ/001

    You should use a while read... instead of for...

    Modified script (untested)

    output_path=s3://output
    unziped_dir=s3://2019-01-03
    
    hadoop fs -ls "$output_path"/ | awk '{print $NF}' | grep .gz$ | while IFS= read -r f
    do   
        echo "available files are: $f"
        filename=$(hadoop fs -ls "$f" | awk -F '/' '{print $NF}' | head -1)
        hdfs dfs -cat "$f" | gzip -d | hdfs dfs -put - "${unziped_dir}/${filename%.*}"
        echo "unziped file names: ${filename%.*}"
    done