output_path=s3://output
unziped_dir=s3://2019-01-03
files=`hadoop fs -ls $output_path/ | awk '{print $NF}' | grep .gz$ | tr '\n' ' '`;
for f in $files
do
echo "available files are: $f"
filename=$(hadoop fs -ls $f | awk -F '/' '{print $NF}' | head -1)
hdfs dfs -cat $f | gzip -d | hdfs dfs -put - $unziped_dir"/"${filename%.*}
echo "unziped file names: ${filename%.*}"
done
Output:
Dev:
available files are: s3://2019-01-03/File_2019-01-03.CSV.gz
unziped file names: File_2019-01-03.CSV
available files are: s3://2019-01-03/Data_2019-01-03.CSV.gz
unziped file names: Data_2019-01-03.CSV
available files are: s3://2019-01-03/Output_2019-01-03.CSV.gz
unziped file names: Output_2019-01-03.CSV
Prod:
available files are: s3://2019-01-03/File_2019-01-03.CSV.gz s3://2019-01-03/Data_2019-01-03.CSV.gz s3://2019-01-03/Output_2019-01-03.CSV.gz
unziped file names:
I am trying to look into a directory and identify the .gz files and iterate them to unzip the all .gz files and store into a different directory. But when am running this script in EMR dev cluster, its works fine. But in prod cluster its not. Please find the behavior of the script above.
There seems to be a problem with the word splitting in for f in $files
. Normally the shell should split the value $files
at the spaces as it does on Dev.
On Dev f
is set to one of the three words from $files
in every cycle of the for
loop, on Prod f
gets the complete value of $files
including spaces.
Do you set variable IFS
somewhere?
If the problem is not in other parts of your script you should be able to reproduce the problem with a reduced script:
files="foo bar baz"
for f in $files
do
echo "available files are: $f"
done
If this minimal script doesn't show a difference the problem is in other parts of your script.
To see if the value of IFS
is different on Dev and Prod you can add this to the minimal script or to your original script just before the for
loop:
# To see if IFS is different. With the default value (space, tab, newline) the output should be
# 0000000 I F S = # \t \n # \n
# 0000012
echo "IFS=#${IFS}#" | od -c
If you see a difference in the value of IFS
you have to find out where IFS
is modified.
BTW: Normally you can omit | tr '\n' ' '
after the grep command. The shell should accept \n
as word splitting character when processing for f in $files
. If not, this is probably related to the source of your problem.
Edit: There is a better solution to process the data line by line, see
https://mywiki.wooledge.org/DontReadLinesWithFor and
https://mywiki.wooledge.org/BashFAQ/001
You should use a while read
... instead of for
...
Modified script (untested)
output_path=s3://output
unziped_dir=s3://2019-01-03
hadoop fs -ls "$output_path"/ | awk '{print $NF}' | grep .gz$ | while IFS= read -r f
do
echo "available files are: $f"
filename=$(hadoop fs -ls "$f" | awk -F '/' '{print $NF}' | head -1)
hdfs dfs -cat "$f" | gzip -d | hdfs dfs -put - "${unziped_dir}/${filename%.*}"
echo "unziped file names: ${filename%.*}"
done