Search code examples
hadoopcascadingscaldingsequencefile

Outputting a Scalding TypedPipe to a SequenceFile in multiple directories based on one of the fields


I'm using Scalding on Hadoop, I have a large dataset in the form of a TypedPipe I wish to output in chunks based on one of the data fields.

For example the data is <category, field1, field2>, and I want the data for each category stored in a SequenceFile in a separate category, e.g. outPath/cat1, outPath/cat2 etc. And I want a single MapReduce phase (or avoid loops).

I have read about the TemplatedTsv option here: How to bucket outputs in Scalding

And here: How to output data with Hive-style directory structure in Scalding?

However this works only if you want a Tsv file, not a SequenceFile.

Obviously a loop works:

var category = 0L

for (category <- categories) {
    data
    .filter(_.category == category)
    .map(t => (NullWritable.get, new BytesWritable(SerializationUtils.serialize(t))))
    .write(WritableSequenceFile(outPath + "/" + category))
}

So is there an equivalent way to TemplateTsv which would work with writing a SequenceFile, avoiding a loop?


Solution

  • There is com.twitter.scalding.TemplatedSequenceFile which may do what you need. It looks just like TemplateTsv but with output to SequenceFile