I am looking through a Hadoop file system. With the command
hadoop fs -ls /path/to/dir1*
I will look through every directory that starts with dir1
and return their files
The output will be something like
Found 1 items
-rw-r--r-- 3 sys_blah_blah moredate /path/to/dir10/file1.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah moredate /path/to/dir10/file2.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah moredate /path/to/dir10/file3.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah moredate /path/to/dir11/file1.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah moredate /path/to/dir11/file2.py
...
The only piece of information I am interested in is the path to file portion. How can I store only the paths into another file? Ideally, I would like an output of a file with only the paths within that file.
Initially, I thought about running the command and storing it's output into a file and then parse the new file and grab the paths with regex and placing them into a new file but that seems unnecessary.
You can make use of grep
here since hadoop fs -ls
hardly provides any useful options.
hadoop fs -ls /path/to/dir1* | grep -oE "/(.*/)?" > outFile.dat
If only one entry for each path is needed, just pass it to uniq
. eg :
hadoop fs -ls /path/to/dir1* | grep -oE "/(.*/)?" | uniq > outFile.dat
Looks pretty simple.