Search code examples
javahadoopcascadingscalding

Hadoop-Cascading: Partial directory source tap


My data have structure like this:

+data
|-2014080700_00.txt
|-2014080700_01.txt
|-2014080701_00.txt
|- ...
|-2014080723_00.txt
|-2014080800_00.txt
|- ...
|-2014090800_00.txt

I know I can use all the file inside data directory with Tap like below:

Tap inTap = new Hfs( new TextLine(), "/path/to/data"); 

But I want specific part of the directory, for example only file on date 20140807. Hence it will include all file with prefix 20140807. Is there any way to do it with Cascading? Or is there any way to do it with scalding?


Solution

  • I don't think you can do it using Hfs, but it's possible using GlobHfs.

    Try the following:

    Tap inTap = new GlobHfs( new TextLine(), "/path/to/data/", new GlobFilter("20140807*"));
    

    This creates a Globbing tap, using "/path/to/data/" directory as source and filtering the files inside using "20140807*" glob pattern passed to GlobFilter.